1 Introduction

Diversification as a concept is as old as investing itself, coming into focus in particular during crisis periods. The global financial crisis of 2008, for example, induced heavy losses for most asset portfolios held by institutional investors, prompting practitioners to question their portfolio construction methodologies and understanding of the level of diversification that can thus be achieved. This led to increased activity in both academia and the financial industry seeking to develop new portfolio construction techniques with the goal of obtaining a well diversified portfolio. Despite the fact that an overwhelming majority of investors seeks to hold a well diversified portfolio, there is still no agreed upon definition or measure of diversification in the literature or among practitioners. A common understanding is that a diversified portfolio should provide risk dissemination and be protected against large drawdowns. Expressed differently, the risk of the portfolio should not be concentrated to only a few risk factors and the tail risk of the portfolio should be controlled. In the diversification literature, the focus was initially purely on volatility reduction, but this definition cannot lead to a useful measure as hedging reduces volatility but does not lead to a better risk dissemination. Most diversification measures and construction methods in the literature are based on the covariance matrix which can be traced back to the seminal paper on mean-variance optimization by Markowitz [55]. Other approaches, with a growing literature used by many asset managers and institutional investors, include the risk parity approach coined by Qian [59] (see also e.g. Qian [60], Roncalli [63] and Roncalli and Weisang [64]) and the most diversified portfolio of Choueifaty and Coignard [21] (see also Choueifaty et al. [22]). Another portfolio diversification measure based on the covariance matrix is introduced in Meucci [58]. There, the asset universe is orthogonalized via Principal Component Analysis of the covariance matrix leading to a new universe consisting of uncorrelated so called principal portfolios. Based on the squared weighted volatilities of the portfolio in the new universe, a portfolio diversification measure based on the dispersion of the squared weighted volatilities is defined. The interpretation of this measure put forth by Meucci is that it represents the effective number of uncorrelated bets that the portfolio is exposed to.

The primary drawback with existing portfolio diversification approaches is that they are based on the distribution of portfolio volatilities or marginal contributions to volatility; however, as far as we are aware, there is not a direct connection from these measures to the distributional or sampling properties of portfolio returns. This in turn leads to a situation where different metrics are shown to behave intuitively in particular cases, while singular counterexamples raise questions about how broadly a technique can be applied. For example, unintuitive behaviour is observed in risk parity portfolios when highly correlated assets are added to the portfolio, leading to over-allocation to them. Unintuitive behaviour of the measure introduced in Meucci [58] led to the introduction of a new technique in Meucci (2013). For this reason we argue, following Fleming and Kroeske [31], that it is helpful to augment the covariance based frameworks and connect diversification to additional properties of the distribution of portfolio returns such as higher order moments.

Bringing in higher order moments allows us to move beyond limiting Gaussian assumptions. As a relatively extreme example, consider a two strategy portfolio combining an equity index exposure and a volatility selling strategy on the same index. The volatility of most volatility selling strategies is lower than that of the equity index, while the negative skewness and the kurtosis are more pronounced. A portfolio construction strategy based on volatility would thus put larger weight to the volatility selling strategy in order to decrease volatility risk at the expense of being more exposed to tail risk. Such a portfolio would have suffered heavy losses during the VIX spike on 5 February 2018. On that day the VIX index experienced its largest one-day jump in its 25 year history, rising 20 points from 17.31 from the previous day’s close to 37.32 at the end of the trading day. This example highlights why we require our diversification measure to have a direct link to the tail properties of the distribution of portfolio returns. Secondly, we want the measure to be leverage invariant as being 100% exposed to S &P 500 is as diversified as being 50% exposed to S &P 500 and leaving the rest in cash. Thus, the diversification measure should not be based on portfolio volatility or Expected Shortfall alone. After all, the best strategy to reduce volatility or Expected Shortfall is to have more capital allocated in cash but that does not increase the diversification, it simply represents a reduced exposure to risky assets.

In this paper we introduce a framework that offers a coherent foundation for understanding portfolio diversification by connecting it to the non-Gaussian properties of portfolio returns. The requirement that the diversification measure is leverage invariant naturally leads to measures based on ratios of homogeneous functions of equal degree, such as kurtosis (degree 4) or the square of skewness (degree 6). Even though, for example, the fourth moment and the square of variance are both convex functions, the ratio that yields kurtosis is not necessarily convex. Optimizing ratios of convex functions is a global optimization problem with potentially several local optima which are not equal to the global optimum or optima. We therefore develop two methods, one deterministic and one stochastic, for the global optimization of ratios of convex functions and use them to conduct initial numerical experiments.

2 Non-Gaussianity as a measure of diversification

In this section we present a framework which first connects non-Gaussianity and diversification before introducing the notion of portfolio dimensionality. The main goal of the portfolio diversification framework is to manage the distribution of the portfolio returns. One observes in fact that, the distribution of portfolio volatilities or marginal contributions to volatility across assets, which make up the portfolio, is irrelevant within a mean-variance framework. This is due to the underlying assumption of normally distributed returns. All that matters is portfolio variance. From this simple observation, we argue that meaningful measures of diversification must be related to additional properties of the distribution of portfolio returns. A natural extension of existing frameworks is to connect the concept of diversification to the non-Gaussian properties of the distribution of portfolio returns. Given that the mean-variance framework assumes Gaussianity, diversification can then be seen as an augmentation which relates to model limitations.

2.1 A novel approach to portfolio diversification: dimensionality

In an ideal world, we propose that one could define portfolio dimensionality as the number of equally sized independent return streams in the portfolio. This definition is intuitively related to risk dissemination and, arguments based on the Central Limit Theorem (CLT) imply that adding independent exposures to the portfolio leads to a portfolio whose distribution is closer to the Gaussian distribution and thereby the tail risk is reduced. Obviously, financial markets do not obey the idealized assumptions of independent and identically distributed (i.i.d.) returns of the standard CLT (see [10], for some examples of the CLT with relaxed assumptions). The idea behind our diversification measures is to base them on the degree of non-Gaussianity of the portfolio return distribution. A portfolio with a low degree of non-Gaussianity is a well diversified portfolio, and vice versa. Measuring the degree of non-Gaussianity is directly related to the tail properties of the portfolio and naturally leads to measures which are leverage invariant. Measuring and optimizing non-Gaussianity have been thoroughly studied in the Independent Component Analysis (ICA) literature, see e.g. Hyvärinen and Oja [44]. A common measure of non-Gaussianity in the ICA literature is kurtosis, and other frequently used measures are based on neg-entropy or Kullback–Leibler divergence. Inspired by the ICA literature, we initially link the notion of a well diversified portfolio to a portfolio with a low kurtosis which implies a low (symmetric) tail risk. Other attractive aspects of using kurtosis are that we see it as a natural extension of a symmetric risk framework and it is also related to the distribution of sample variance. In particular, it is known that the variance of the distribution of sample variance is positively related to kurtosis (see e.g. [75]). Reducing kurtosis therefore increases confidence in estimates of portfolio variance.

That said, asymmetry in the form of skewness is also of interest to investors where empirical results from the risk premia literature (Lempérière et al. [53]) show that maximising the Sharpe ratio of a portfolio is strongly linked to maximizing the negative skewness of portfolio returns. There are several ways to incorporate skewness into a portfolio diversification framework. In Lassance and Vrins [51], a portfolio risk measure based on exponential Rényi entropy is used in order to incorporate higher order moments into the portfolio decision framework. Through a truncated Gram–Charlier expansion of Rényi entropy they demonstrate that their portfolio risk measure can be directly expressed as a function of portfolio skewness and kurtosis. Another approach, see e.g. Jondeau and Rockinger [46], relies on a higher order Taylor expansion of the investors utility function, which leads to an expression in terms of the non-standardized portfolio moments. This latter approach suffers from the drawback of optimizing an objective function which is not invariant to leverage. In the following, we offer a general framework which allows us to look at various measures including skewness and kurtosis but for the purposes of these initial numerical investigations we focus on kurtosis.

2.2 From non-Gaussianity to dimensionality: definition and examples

With all the above in mind, we proceed with defining a diversification framework which is invariant to leverage and directly linked to the tail properties of the distribution of portfolio returns. It is also flexible enough to allow different objective functions, such as excess kurtosis and the square of skewness (along with any suitable linear or polynomial combination), but within a robust setting for measuring, in an appropriate sense, the level of non-Gaussianity of resulting portfolios. Furthermore, in order to have an intuitive interpretation of diversification we link it with the tail risk of an equally weighted reference portfolio of i.i.d. reference assets. This reference portfolio is representative of the given asset universe and we proceed to define the notion of portfolio dimensionality relative to the tail risk of the reference asset.

Definition 1

Let p be a positive integer and \({\mathcal {L}}^p\) denote the set of all random variables with finite p-th (absolute) moment. Let also \({\mathcal {X}}\) be a convex subset of \({\mathcal {L}}^p\) and \({\mathcal {N}}\) denote the set of all Gaussian random variables. We define a function \(\nu : {\mathcal {L}}^p \mapsto {\mathbb {R}}_+\) that satisfies the following properties:

  1. (i)

    \(\nu (tX) = \nu (X)\), for any \(t>0\) and \(X \in {\mathcal {L}}^p\). (leverage invariance)

  2. (ii)

    Let \(Y \in {\mathcal {X}}\) and, moreover, let \(Y_1, Y_2,\ldots \in {\mathcal {X}}\) and independently follow \(\mathrm{Law}(Y)\). The function

    $$\begin{aligned} \phi _{Y,\nu }(n)= \nu \left( \sum _{i=1}^{n}Y_i\right) \end{aligned}$$
    (1)

    is strictly decreasing in \(n\in {\mathbb {N}}\) for any \(Y\in {\mathcal {X}}\). (strict monotonicity for i.i.d. data)

  3. (iii)

    \(\nu (X)\ge 0\) for every \(X \in {\mathcal {X}}\cup {\mathcal {N}}\) with equality holding only if \(X \in {\mathcal {N}}\). (positivity)

  4. (iv)

    \(\nu (X)= \nu (Y)\) for any \(X, Y\in {\mathcal {L}}^p\) such that \(\mathrm{Law}(X)= \mathrm{Law}(Y)\). (law invariance)

  5. (v)

    For constant \(c\in {\mathbb {R}}\), \(\nu (X+c)= \nu (X)\) for every \(X \in {\mathcal {L}}^p\). (cash invariance)

Remark 2.1

One notes that the functional \(\nu \), which sometimes is referred to as a risk measure without necessarily adhering to the formal definition of a risk measure as in Artzner et al. [4] but rather follow the generic description in asset management, is only required to be nonnegative over \({\mathcal {X}}\cup {\mathcal {N}}\). Moreover, \(\nu \) is used henceforth to measure the level of non-Gaussianity of resulting portfolios. See also the kurtosis example below along with the related discussion.

Then, one proceeds by defining a measure of diversification relative to a reference random variable \(Z\in {\mathcal {X}}\). First, let us recall the definition for the \((n-1)\)-dimensional probability simplex, which is given by \(\mathcal {W}_n=\{\varvec{w}\in \mathbb {R}^{n}\text { }\mid \sum _{i=1}^{n} w_i = 1, w_i \ge 0, i=1,\ldots ,n \}\) where \(\varvec{w}=[w_1,\ldots ,w_n]^{\top }\).

Definition 2

Let \(Z, X_1, \ldots , X_n \in {\mathcal {X}}\) and \(\nu \) satisfy the properties (i)–(v) from Definition 1. Let also \(\nu \left( \sum _{i=1}^{n}w_iX_i\right) \) be a continuous function in \(\varvec{w}\in \mathcal {W}_n\). Then, the function \(D_{Z,\,\nu }: \mathcal {W}_n \mapsto {\mathbb {R}}_+\) is defined by

$$\begin{aligned} {(\mathbf{Diversification} measure ):} \qquad D_{Z,\,\nu }(\varvec{w}) = \frac{\nu (Z)}{\nu \left( \sum _{i=1}^{n}w_iX_i\right) }.&\end{aligned}$$
(2)

Remark 2.2

In the context of asset management, the above setting for the diversification measure is understood as \(X_i\in {\mathcal {X}}\) being the return of the i-th asset of the portfolio and \(w_i\) is the corresponding portfolio weight.

Definition 3

Let \({\mathcal {X}}\) be a convex subset of \({\mathcal {L}}^p\). Let also \(\{Z_i\}_{i\ge 1}\) be a sequence of i.i.d. random variables such that each \(Z_i \in {\mathcal {X}}\) and let \(Z \in {\mathcal {X}}\). Moreover, suppose that \(\mathrm{Law}(Z_1)= \mathrm{Law}(Z)\) and that \(\nu \) satisfies the properties (i)–(v) from Definition 1. We define \({\hat{\eta }}:{\mathbb {N}}\backslash \{0\} \mapsto {\mathbb {R}}_+ \) by

$$\begin{aligned} {\hat{\eta }}(k) = \frac{\nu (Z)}{\nu \left( \frac{1}{k}\sum _{i=1}^{k}Z_i\right) }, \text{ for } \text{ any } k\in {\mathbb {N}}\backslash \{0\}. \end{aligned}$$

We further define the function \(\eta :{\mathbb {R}}_+ \mapsto {\mathbb {R}}_+ \) as the continuous, monotonic (linear) interpolation of \({\hat{\eta }}\) on \({\mathbb {R}}_+\).

Recall here that given a collection of i.i.d. random variables \(\{Z_i\}_{1\le i\le n}\) and Z, which belong to \({\mathcal {X}}\), such that \(\mathrm{Law}(Z_1)= \mathrm{Law}(Z)\), \({\hat{\eta }}(k)\) represents the diversification measure of an equally weighted portfolio in those \(Z_i\)’s. Moreover, due to the leverage invariance of \(\nu \) and the strict monotonicity of the function \(\phi _{Z,\nu }\), see (1), one guarantees the existence of the function \(\eta :{\mathbb {N}}\backslash \{0\} \mapsto {\mathbb {R}}_+ \), which is also a strictly increasing function. Hence, the definition of portfolio dimensionality follows naturally by considering a suitable transformation, which involves the inverse of the function \(\eta \).

Definition 4

Let \(D_{Z,\,\nu }(\varvec{w})\) be described by Definition 2 and the function \(\eta \) by Definition 3. We define the function \( d_{Z,\,\nu }: \mathcal {W}_n \mapsto {\mathbb {R}}_+\) by

$$\begin{aligned} {(\mathbf{Portfolio} dimensionality ):} \qquad d_{Z,\,\nu }(\varvec{w}) = \eta ^{-1}\left( D_{Z,\,\nu }(\varvec{w})\right) .&\end{aligned}$$
(3)

Remark 2.3

One observes that, for any \(k\le n\),

$$\begin{aligned} D_{Z,\,\nu }(\varvec{w}) = \frac{\nu (Z)}{\nu \left( \frac{1}{k}\sum _{i=1}^{k}Z_i\right) } \frac{\nu \left( \sum _{i=1}^{k}Z_i\right) }{\nu \left( \sum _{i=1}^{n}w_iX_i\right) } = \eta (k) \frac{\nu \left( \sum _{i=1}^{k}Z_i\right) }{\nu \left( \sum _{i=1}^{n}w_iX_i\right) }. \end{aligned}$$
(4)

Thus,

$$\begin{aligned} \nu \left( \sum _{i=1}^{n}w_iX_i\right) = \nu \left( \frac{1}{k}\sum _{i=1}^{k}Z_i\right) \qquad \Rightarrow \qquad d_{Z,\,\nu }(\varvec{w})=k, \end{aligned}$$

since the denominator in (4) is equal to \(\phi _{Z,\nu }(k)\), which implies that \(D_{Z,\,\nu }(\varvec{w})= \eta (k)\). Thus, we see that the portfolio dimensionality is exactly the number of independent return streams.

Remark 2.4

In general, a judicious selection of the reference asset as representative of the investment universe will produce values of \(D_{Z,\,\nu } > 1\) as we achieve some relative diversification benefit; however, we note that \(D_{Z,\,\nu }<1\) is also possible if we worsen the relative non-Gaussianity.

Let us concentrate now on the case where \(\nu \) is either excess kurtosis or the square of skewness (as this allows us to consider other suitable linear or polynomial combinations of interest using these two risk measures). Using the leverage invariance property of \(\nu \) and taking into account the findings of Fleming and Kroeske [31], where the notion of the distribution of portfolio variance is used and the effective size of its support is related to the spectrum of Rényi entropies, one identifies \(\eta \) with the identity function and \(d_{Z,\,\nu }(\varvec{w}) = D_{Z,\,\nu }(\varvec{w})\). For completeness, we present here the relevant derivations, although they can be found in Fleming and Kroeske [31], albeit with different notation.

Proposition 2.5

Let \({\mathcal {X}}\) be a convex subset of \({\mathcal {L}}^p\). Then

  1. (a)

    For \(p\ge 3\), the square of skewness, i.e. \(\nu (\cdot )=\left( \frac{{\mathbb {E}}[(\cdot -{\mathbb {E}}[\cdot ])^3]}{({\mathbb {E}}[(\cdot -{\mathbb {E}}[\cdot ])^2])^{3/2}}\right) ^2\), satisfies properties (i)–(v) from Definition 1, provided that \(\nu (X)>0\) for every \(X\in {\mathcal {X}}\). Moreover, \({\hat{\eta }}(k)=k\) for any \(k\in {\mathbb {N}}\).

  2. (b)

    For \(p\ge 4\), the excess kurtosis, i.e. \(\nu (\cdot )=\frac{{\mathbb {E}}[(\cdot -{\mathbb {E}}[\cdot ])^4]}{({\mathbb {E}}[(\cdot -{\mathbb {E}}[\cdot ])^2])^2} - 3\), satisfies properties (i)–(v) from Definition 1, provided that \(\nu (X)>0\) for every \(X\in {\mathcal {X}}\). Moreover, \({\hat{\eta }}(k)=k\) for any \(k\in {\mathbb {N}}\).

Proof

We start the proof by considering an arbitrary sequence of i.i.d. random variables \(\{Y_i\}_{i\ge 1}\) such that each \(Y_i \in {\mathcal {X}}\).

(a) One observes that properties (i), (iii), (iv) and (v) are satisfied trivially due to the definition of the square of skewness, the linearity of expectation and the assumption that \(\nu (X)>0\) for every \(X\in {\mathcal {X}}\). It remains to show property (ii). To this end, one considers an arbitrary convex combination of n i.i.d. random variables, say \(\sum _{i=1}^{n}w_iY_i\), where \(w_i\in [0,1]\) and \(\sum _{i=1}^{n}w_i=1\). Further and without loss of generality, it is assumed that the \(Y_i\)’s have mean zero (as the central second and third moments are considered in the definition of skewness). One further defines \(p_i:=w_i^2/\sum _iw_i^2\), the vector \({\mathbf {p}}_n:=[p_1, \ldots , p_n]^{\top }\) in \({\mathbb {R}}^n\) and the function

$$\begin{aligned}&D_{3/2}({\mathbf {p}}_n):=\left( \sum _{i=1}^{n}p_i^{3/2}\right) ^{-2}, \text{ with } D_{3/2}({\mathbf {p}}_n) =n \text{ when } w_1=\ldots =w_n=1/n. \end{aligned}$$

One then calculates, due to the i.i.d. nature and zero mean property of the random variables under consideration,

$$\begin{aligned} \nu \left( \sum _{i=1}^{n}w_iY_i\right) =&\left( \frac{\sum _{i=1}^{n}w_i^3{\mathbb {E}}[Y_1^3]}{\left( \sum _{i=1}^{n}w_i^2{\mathbb {E}}[Y_1^2]\right) ^{3/2}}\right) ^2 = \nu (Y_1)\left( \frac{\sum _{i=1}^{n}w_i^3}{\left( \sum _{i=1}^{n}w_i^2\right) ^{3/2}}\right) ^2 \\ =&\nu (Y_1)\left( \sum _{i=1}^{n}p_i^{3/2}\right) ^2 = \frac{\nu (Y_1)}{D_{3/2}({\mathbf {p}}_n)}. \end{aligned}$$

Thus, due to property (i), i.e. leverage invariance and the fact that \(D_{3/2}({\mathbf {p}}_n)=n\) when \(w_1=\ldots =w_n=1/n\), one obtains

$$\begin{aligned} \nu \left( \sum _{i=1}^{n}Y_i\right) = \nu \left( \sum _{i=1}^{n}\frac{1}{n}Y_i\right) = \frac{\nu (Y_1)}{n}, \end{aligned}$$

which is strictly decreasing in n. Moreover, for any \(Y \in {\mathcal {X}}\) such that \(\mathrm{Law}(Y_1)= \mathrm{Law}(Y)\),

$$\begin{aligned} {\hat{\eta }}(k) = \frac{\nu (Y)}{\nu \left( \frac{1}{k}\sum _{i=1}^{k}Y_i\right) } = k\frac{\nu (Y)}{\nu (Y_1)}=k, \end{aligned}$$

and thus the desired result is obtained.

(b) Following the approach from (a), one observes that again the properties (i), (iii), (iv) and (v) are satisfied trivially due to the definition of excess kurtosis, the linearity of expectation and the assumption that \(\nu (X)>0\) for every \(X\in {\mathcal {X}}\). Also, in a similar approach one considers an arbitrary convex combination of n i.i.d. random variables, which is denoted again by \(\sum _{i=1}^{n}w_iY_i\). One, however, now considers the function

$$\begin{aligned} D_{2}({\mathbf {p}}_n):=\left( \sum _{i=1}^{n}p_i^2\right) ^{-1}, \text{ with } D_{2}({\mathbf {p}}_n)=n \text{ when } w_1=\ldots =w_n=1/n. \end{aligned}$$

Consequently, one calculates by taking into consideration the i.i.d. nature and zero mean property of the random variables under consideration

$$\begin{aligned} \nu \left( \sum _{i=1}^{n}w_iY_i\right) =&\frac{\sum _{i=1}^{n}w_i^4{\mathbb {E}}[Y_1^4]+ 3\sum _{i\ne j}w_i^2w_j^2({\mathbb {E}}[Y_1^2])^2}{\left( \sum _{i=1}^{n}w_i^2{\mathbb {E}}[Y_1^2]\right) ^2} -3\\ =&\left( \frac{{\mathbb {E}}[Y_1^4]}{({\mathbb {E}}[Y_1^2])^2} \frac{\sum _{i=1}^{n}w_i^4}{\left( \sum _{i=1}^{n}w_i^2\right) ^2}+ 3\sum _{i\ne j} \frac{w_i^2}{\sum _{i=1}^{n}w_i^2} \frac{w_j^2}{\sum _{i=1}^{n}w_i^2} \right) -3 \\ =&\left( (\nu (Y_1)+3)\sum _{i=1}^{n}p_i^2+ 3\sum _{i\ne j}p_ip_j\right) -3 \\ =&\left( (\nu (Y_1)+3)\sum _{i=1}^{n}p_i^2 + 3(1-\sum _{i=1}^{n}p_i^2)\right) -3. \end{aligned}$$

The last equality is due to the fact that \(\sum _{i=1}^{n}p_i=1\). Thus,

$$\begin{aligned} \nu \left( \sum _{i=1}^{n}Y_i\right) = \nu \left( \sum _{i=1}^{n}\frac{1}{n}Y_i\right) = \frac{\nu (Y_1)}{D_{2}({\mathbf {p}}_n)} = \frac{\nu (Y_1)}{n}, \end{aligned}$$

which is strictly decreasing in n and hence the desired result is also obtained in this case as in part (a). \(\square \)

Corollary 2.6

Let \({\mathcal {X}}\) be a convex subset of \({\mathcal {L}}^4\) and \(\nu _1\) and \(\nu _2\) are used to denote excess kurtosis and squared skewness respectively, i.e.

$$\begin{aligned} \nu _1(\cdot )=\frac{{\mathbb {E}}[(\cdot -{\mathbb {E}}[\cdot ])^4]}{({\mathbb {E}}[(\cdot -{\mathbb {E}}[\cdot ])^2])^2} - 3 \qquad \text{ and } \qquad \nu _2(\cdot )=\left( \frac{{\mathbb {E}}[(\cdot -{\mathbb {E}}[\cdot ])^3]}{({\mathbb {E}}[(\cdot -{\mathbb {E}}[\cdot ])^2])^{3/2}}\right) ^2. \end{aligned}$$

Then

  1. (a)

    Any linear combination with positive coefficients of excess kurtosis and squared skewness, i.e.

    $$\begin{aligned} \nu (\cdot ) = \lambda _1\nu _1(\cdot )+ \lambda _2\nu _2(\cdot ), \text{ where } \lambda _1, \lambda _2 \ge 0, \end{aligned}$$

    satisfies properties (i)–(v) from Definition 1, provided that \(\nu (X)>0\) for every \(X\in {\mathcal {X}}\).

  2. (b)

    The following polynomial combinations with positive coefficients of excess kurtosis and squared skewness (which includes the Jarque–Bera goodness-of-fit test), namely

    $$\begin{aligned} \nu (\cdot ) = \lambda _1\nu _1^2(\cdot )+ \lambda _2\nu _2(\cdot ), \text{ where } \lambda _1, \lambda _2 > 0, \end{aligned}$$

    satisfy properties (i)–(v) from Definition 1, provided that \(\nu (X)>0\) for every \(X\in {\mathcal {X}}\).

Thus, we see that our newly proposed diversification measure can be built by using excess kurtosis and/or squared skewness and that our definition of portfolio dimensionality is satisfied for large and rich sets of random variables. Moreover, since \(\phi _{Z,\nu }\) is a monotonically decreasing function, if

$$\begin{aligned} \phi _{Z,\nu }(k+1)< \nu \left( \sum _{i=1}^{n}w_iX_i\right)< \phi _{Z,\nu }(k) \text{, } \text{ then } k< D_{Z,\,\nu }(\varvec{w}) < k +1, \end{aligned}$$

with \(D_{Z,\,\nu }(\varvec{w})\) taking a non-integer value according to a monotonic interpolation of \(\phi _{Z,\nu }\). As a result, one observes that the higher the values for \(D_{Z,\,\nu }(\varvec{w})\), the closer we are to a tail risk similar to the one given by a standard Gaussian. To see this, consider a large enough k and \( D_{Z,\,\nu }(\varvec{w}) \ge k \). Consequently, one obtains due to (4), that the number of independent assets is increased accordingly,

$$\begin{aligned} \nu \left( \sum _{i=1}^{n}w_iX_i\right) \le \phi _{Z,\nu }(k) = \nu \left( \frac{\sum _{i=1}^{k}Z_i}{\sqrt{k{\mathbb {E}}[Z^2]}}\right) , \qquad \text{(due } \text{ to } \text{ leverage } \text{ invariance) } \end{aligned}$$

and thus due to the CLT and property (ii) from Definition 1, one observes the desired result.

Remark 2.7

We focus henceforth on the case of random variables with skewed and leptokurtic distributions (the latter implies positive excess kurtosis). Such random variables represent the asset returns in a given asset universe under consideration. This is chosen in agreement with the relevant literature in quantitative finance and assumes a well-documented phenomenon of asset returns, namely stylized facts which include negative skewness and excess kurtosis due to fat tails, see e.g. Cont [25] and references therein. Thus, \({\mathcal {X}}\) represents, henceforth, the convex hull of a fixed number of such random variables which are used to form a portfolio under consideration. One notes however that \({\mathcal {X}}\) cannot always be a subset of random variables with leptokurtic distributions, which implies that excess kurtosis may become negative. For example, in the theoretical case, where two perfectly negatively correlated assets, say \(X\in {\mathcal {X}}\) and \(-X\in {\mathcal {X}}\), are used to form a portfolio, then the convex combination \(1/2X+1/2(-X)\) has such property. Similarly, there may be (very unlikely to happen in real-world) situations where the assets, which form a portfolio under consideration, exhibit very strong negative correlation, which may result in convex combinations of them having (even marginally) negative excess kurtosis. In all such cases, if the measure of non-Gaussianity is simply excess kurtosis (as in our examples below), then one proceeds with the minimization of excess kurtosis, knowing very well that a negative sign implies tails lighter than those of the Gaussian distribution (which is desirable from the point of view of asset management) but no diversification or portfolio dimensionality number can be produced.

2.3 Desirable properties of the diversification measure: Toy example

Despite the fact that there is no agreed upon definition of diversification in the literature, a number of desirable properties of a diversification methodology have been proposed. In Choueifaty et al. [22], the notion of polico invariance is introduced. Extending an asset universe by adding a positive linear combination of assets already belonging to the universe should not affect the weights to the original assets when applying the diversification methodology. A special case of polico invariance, denoted duplication invariance, considers the duplication of one of the assets in the universe. This case naturally arises in applications when one of the assets is listed on multiple exchanges. Applying the diversification methodology should produce the same portfolio irrespective of any asset in the universe being duplicated or not. In Koumou [49] further desirable properties of diversification measures are introduced. However, some of the properties presented in Koumou [49] are not consistent with the requirements that we have on a diversification measure. In Sect. 1, we introduced the requirement that the portfolio diversification measure should be leverage invariant. This contrasts one of the desired properties presented in Koumou [49]. Furthermore, in Koumou [49], the portfolio diversification measure is required to be concave or quasi-concave. As we have argued, a leverage invariant diversification measure typically leads to a ratio of two convex functions which in general is neither concave nor quasi-concave.

In the following, a numerical example is used to demonstrate that important desirable properties are satisfied by the newly introduced portfolio diversification measure. The demonstration is based on a toy example with a universe consisting of three assets with the following covariance matrix

$$\begin{aligned} {\mathbf {C}}= \left[ \begin{array}{c c c } 1 \text { }&{} \rho \text { }&{} 0 \\ \rho \text { }&{} 1 \text { }&{} 0 \\ 0 \text { }&{} 0 \text { }&{} 1 \end{array} \right] . \end{aligned}$$
(5)

As the correlation \(\rho \) between asset one and asset two approaches one, these two assets behave as one asset and hence this corresponds to the case when one of the assets in the universe is duplicated. For this case, the weight of asset three should approach \(\tfrac{1}{2}\) as \(\rho \rightarrow 1\). When \(\rho \rightarrow -1\), this corresponds to the case when either asset one or asset two is a perfect hedge of the other. In this case, assuming that \({\mathbf {C}}\) is positive definite, the volatility of a portfolio given by the weight vector \(\varvec{w}=[0.5, \text { }0.5, \text { }0]^{\top }\) tends to a small value \(c>0\) as \(\rho \rightarrow -1\). In Choueifaty et al. [22], it is demonstrated that risk parity suffers from duplication invariance. It is well known in the literature that the global minimum variance portfolio tends to be highly concentrated to assets with low volatility, see e.g. Clarke et al. [24]. Thus, for an asset universe where the exposure to some assets to a large extent has been hedged away, the global minimum variance portfolio tends to be highly concentrated to the hedged assets. We denote this undesirable property of the global minimum variance portfolio the hedging invariance problem.

Consistency with the duplication invariance and hedging invariance properties for the introduced diversification framework is illustrated in Fig. 1 for the case when the marginal distributions of the assets can be assumed to be approximately symmetric. In this case, we assume that non-Gaussianity is adequately captured by portfolio kurtosis. The consistency with the desired properties is monitored through the weight of asset three for the cases when \(\rho \rightarrow 1\) and \(\rho \rightarrow -1\). The weight of asset three obtained when minimizing portfolio kurtosis is compared to the corresponding weights obtained with risk parity and from maximizing the diversification ratio introduced in Choueifaty and Coignard [21]. Since the volatilities of the three assets are equal, the portfolio obtained from maximizing the diversification ratio coincides with the global minimum variance portfolio, see Choueifaty and Coignard [21]. For risk parity and the most diversified portfolio, the weight of asset three can be solved analytically and is given by

$$\begin{aligned} w_3^{\text {RP}}=\dfrac{2\sqrt{1+\rho }-(1+\rho )}{3-\rho }, \end{aligned}$$
(6)

for the risk parity portfolio, and

$$\begin{aligned} w_3^{\text {DR}}=\dfrac{1+\rho }{3+\rho }, \end{aligned}$$
(7)

for the maximized diversification ratio and the global minimum variance portfolios. Thus, when \(\rho \rightarrow 1\), the weight of the third asset approaches \(\sqrt{2}-1\) for the risk parity portfolio, whereas \(w_3 \rightarrow \tfrac{1}{2}\) for the maximized diversification ratio. From Fig. 1a, one observes that the minimum kurtosis portfolio and the maximized diversification ratio satisfy the duplication invariance property, whereas risk parity does not.

Fig. 1
figure 1

Weight of asset three for the minimum kurtosis, risk parity and maximized diversification ratio portfolios for the cases when: (a) \(\rho \in [0,1]\) and (b) \(\rho \in (-1,0]\)

When \(\rho \rightarrow -1\), the volatility of a portfolio with weight vector \(\varvec{w}=[0.5, \text { }0.5, \text { }0]^{\top }\) approaches the small value \(c>0\). All portfolio construction methodologies that are based on only the covariance matrix will approach this solution when \(\rho \rightarrow -1\). The question is at which rate. From Fig. 1b, one observes that \(w_3^{\text {DR}}\) approaches zero at a rate which is close to linear when \(\rho \) varies between 0 and –1. Since this corresponds to the behaviour of the global minimum variance portfolio, which suffers from the hedging invariance problem, this rate is too large when \(\rho \) is not close to –1. Fig. 1b reveals that the weight of asset three for both the minimum kurtosis and risk parity portfolios approaches zero at a slower rate compared to the diversification ratio portfolio when \(\rho \) is not close to –1. These portfolios are thus not too heavily concentrated to the partially hedged exposure represented by asset one and asset two in our example. We conclude that the minimum kurtosis and risk parity portfolios satisfy the hedging invariance property. Hence, only the minimum kurtosis portfolio satisfies the two desired properties when the asset distributions are symmetric.

We finally stress that in this paper we do not attempt to accurately estimate higher order moments or joint distributions of assets returns. The multivariate distribution of the asset returns is modelled with a Gaussian copula and marginal distributions which allow for differing skewness and kurtosis parameters for the individual assets. By modelling the dependence structure with a Gaussian copula, we avoid the notoriously difficult task of estimating a nonlinear dependence structure between the assets. The cost of using a model with less uncertainty in the estimated parameters is that we only take linear dependence between the assets returns into account in this paper. In order to obtain a robust implementation of the framework we take the approach of assigning representative tail risk parameters for different asset groups. Based on the assigned tail risk parameters, the diversification framework then lets us measure and optimize the portfolio dimensionality for a given asset universe.

3 Deterministic global optimization of ratios of convex functions

There are numerous applications in finance that involve the optimization of ratios, see, e.g., Stoyanov et al. [73]. In the previous two sections we argued that formulating an appropriately defined portfolio diversification measure naturally leads to functions that are ratios of convex functions. In this section we develop a deterministic algorithm for solving such problems to global optimality.

Let \(\mathcal {A}\subseteq \mathbb {R}^n\) be a nonempty compact convex set and consider the maximization problem

$$\begin{aligned} \max \limits _{\varvec{w}\in \mathcal {A}}\, h(\varvec{w}), \qquad \text {where}\qquad h(\varvec{w}) \,=\, \dfrac{f(\varvec{w})}{g(\varvec{w})} \end{aligned}$$
(8)

and \(f,g: \mathcal {A}\rightarrow \mathbb {R}\) are positive and continuous functions. In Avriel et al. [7] it is shown that when f is concave and g is convex, then \(h(\varvec{w})\) is a semi-strictly quasi-concave function. Many theoretical results, as well as algorithms of convex programming, apply to the problem of maximizing a quasi-concave function over a convex set (see [27, 66, 67]). In particular, each local maximum is again a global maximum. For the case when f and g are either both convex or both concave, \(h(\varvec{w})\) is in general neither a quasi-concave nor quasi-convex function and the function may have multiple local optima that are different from the global optimum.

3.1 Formulation of the portfolio kurtosis minimization problem

Using the notation for higher order portfolio moments introduced in Appendix A, the portfolio kurtosis as a function of the portfolio weights can be expressed as

$$\begin{aligned} \kappa _p(\varvec{w}) \,=\, \dfrac{\mathbb {E}\left( (\varvec{w}^{\top }(\varvec{r}-{\varvec{\mu }}))^4\right) }{(\mathbb {E}\left( \left( \varvec{w}^{\top }(\varvec{r}-{\varvec{\mu }}))^2\right) \right) ^2} \,=\, \dfrac{\varvec{w}^{\top }\mathbf {M}_4(\varvec{w}\otimes \varvec{w}\otimes \varvec{w})}{\left( \varvec{w}^{\top }\mathbf {M}_2\varvec{w}\right) ^2}, \end{aligned}$$
(9)

where \(\varvec{w}\in \mathbb {R}^{n+1}\) is the vector of relative portfolio weights, \(\varvec{r}\in \mathbb {R}^{n+1}\) denotes the vector of asset returns, \({\varvec{\mu }}=\mathbb {E}(\varvec{r})\), and \(\mathbf {M}_2\in \mathbb {R}^{(n+1) \times (n+1)}\) and \(\mathbf {M}_4\in \mathbb {R}^{(n+1) \times (n+1)^3}\) denote the covariance and fourth co-moment matrices of the asset returns, respectively. We assume that \(\mathbf {M}_2\) is positive definite and hence that \(\varvec{w}^{\top }\mathbf {M}_2\varvec{w}>0\) for all non-zero \(\varvec{w}\). Therefore, the ratio (9) is well defined. By application of Jensen’s inequality we also have that

$$\begin{aligned}&\mathbb {E}\left( (\varvec{w}^{\top }(\varvec{r}-{\varvec{\mu }}))^4\right) \ge (\mathbb {E}\left( \left( \varvec{w}^{\top }(\varvec{r}-{\varvec{\mu }}))^2\right) \right) ^2 \quad \text {and thus}\quad \varvec{w}^{\top }\mathbf {M}_4(\varvec{w}\otimes \varvec{w}\otimes \varvec{w})>0 \nonumber \\&\quad \text { for all non-zero }\varvec{w}. \end{aligned}$$
(10)

The convention for the majority of papers in the fractional programming literature is to formulate the fractional program as a maximization problem. Since \({{\,\mathrm{argmax}\,}}_{\varvec{w}} f(\varvec{w})/g(\varvec{w}) = {{\,\mathrm{argmin}\,}}_{\varvec{w}} g(\varvec{w})/f(\varvec{w})\), for \(f(\varvec{w})>0\) and \(g(\varvec{w})>0\), we formulate the portfolio kurtosis optimization problem as the following maximization problem, which we denote by (P)

$$\begin{aligned} (\text {P})&\quad \max \limits _{\varvec{w}\in {\mathcal {W}}}h(\varvec{w}), \qquad \text {where}\qquad h(\varvec{w})=\dfrac{f(\varvec{w})}{g(\varvec{w})}=\dfrac{\left( \varvec{w}^{\top }\mathbf {M}_2\varvec{w}\right) ^2}{\varvec{w}^{\top }\mathbf {M}_4(\varvec{w}\otimes \varvec{w}\otimes \varvec{w})} \end{aligned}$$
(11)

and \({\mathcal {W}}\) denotes the feasible set for the weights. Since we assume no short selling and a fully invested portfolio, the feasible set is given by \(\mathcal {W}=\{\varvec{w}\in \mathbb {R}^{n+1}\text { }\mid \sum _{i=0}^{n} w_i = 1, w_i \ge 0, i=0,\ldots ,n \}\). Letting \(\varvec{w}^*\) denote the optimal weights, the minimum kurtosis over the feasible set is then given by \(\kappa _p(\varvec{w}^*)=1\big / h(\varvec{w}^*)\). Since \(\mathbf {M}_2\) is positive definite, the numerator in (11) is a convex function. In Athayde and Flores [6] it is shown that the fourth moment of the portfolio return is a convex function and hence (11) is a ratio of two convex functions.

3.2 Branch and Bound algorithm for global minimization of portfolio kurtosis

Global optimization of ratios of convex functions is a very difficult optimization problem and has attracted attention in the optimization research community. In this section we present a Branch and Bound (BB) algorithm for global minimization of portfolio kurtosis. The basic idea of BB is to recursively subdivide the solution space geometrically into smaller and smaller subsets, until we can either compute the optimal solution over a subset or rule out that a subset contains the global optimum. A crucial component of the algorithm and key to its efficiency, is the derivation of tight upper and lower bounds on the objective function value, both globally and locally for each subset. Examples of papers in the literature which develop BB algorithms for the special case of ratios of convex quadratic functions are Gotoh and Konno [39], Benson [13] and Yamamoto and Konno [78]. The first, and to the best of our knowledge only, paper which develops a BB algorithm for global optimization of a single ratio of general convex functions is Benson [14]. The generalized problem of optimizing a sum of ratios of convex functions has also attracted considerable attention in the literature. In Shen et al. [70] a BB algorithm for global optimization for the sum of ratios of convex functions over a convex set is developed, while Shen et al. [69] develop a BB algorithm for the case of optimizing the sum of ratios of convex functions when the feasible set is non-convex. Comprehensive treatments of BB algorithms for global optimization can be found in Horst and Tuy [43] and Floudas [32].

We apply the BB algorithm developed by Benson [14] to the problem of portfolio kurtosis minimization and improve the convergence rate by constructing considerably tighter bounds. In the following we first give an overview of the BB algorithm before we describe the steps of the procedure in more detail. As input to the algorithm, one chooses an error tolerance \(\rho \) which determines the maximum allowed relative distance between the output value of the algorithm and the global optimum. The output of the algorithm is a \(\rho \)-globally optimal solution:

Definition 5

(\(\rho \)-globally optimal solution) A solution \(\varvec{w}^k\in \mathcal {W}\) for problem (P) is called \(\rho \)-globally optimal, if \(h(\varvec{w}^k) \ge (1-\rho )h(\varvec{w}^*)\), where \(\rho \in [0,1)\) and \(\varvec{w}^*\) is an optimal solution for (P).

The basic idea of the BB algorithm is rather simple and consists of the following elements.

Branching process Consists of choosing a subset \(S \subseteq \mathcal {W}\) that is to be subdivided, and then applying a partitioning method for splitting this subset into two smaller subsets.

Upper bounding process Consists of solving a subproblem to obtain an upper bound UB(S) for the maximum of \(h(\varvec{w})\) over each subset \(S \subseteq \mathcal {W}\) created by the branching process. Moreover, the upper bound for each subset is used to update a global upper bound UB for the maximum of \(h(\varvec{w})\) over \(\mathcal {W}\).

Lower bounding process Consists of calculating a lower bound LB(S) for the maximum of \(h(\varvec{w})\) over each subset \(S \subseteq \mathcal {W}\) created by the branching process. Moreover, the lower bound for each subset is used to update the global lower bound LB for the maximum of \(h(\varvec{w})\) over \(\mathcal {W}\).

Fathoming process Deletes each subset \(S \subset \mathcal {W}\) in the partition which satisfies \((1-\rho )UB(S) \le LB\). The algorithm stops when all subsets have been fathomed, i.e., the partition is empty.

Unlike heuristic methods, BB algorithms terminate with the guarantee that the value of the best found solution is \(\rho \)-globally optimal. BB algorithms are however often slow, and in many cases they require computational effort that grows exponentially with the problem size. This is due to the fact that the size of the partition will grow from iteration to iteration, unless we can fathom subsets. Fathoming subsets, however, depends on the quality of the lower and, especially, the upper bound for a subset. If the upper bound is loose, then a good feasible solution found early in the search may be detected as good only much later in the partitioning process. In other words, the main computational burden of the BB algorithm typically comes from proving global optimality of a feasible point found at an early stage. Thus, in order for the BB algorithm to be efficient, it is crucial to carefully model the functions used for producing upper bounds for each subset generated by the branching process, to be able to fathom them as quickly as possible. Compared to the BB algorithm in Benson [14], we develop two extensions which provide much tighter upper bounds and, thereby, considerably speed up the convergence of the algorithm. Next, we will give a more detailed description of the BB algorithm applied to the problem of minimizing portfolio kurtosis.

3.2.1 Branching process

The branching process splits the feasible set into successively finer partitions. We denote by \(\mathcal {Q}_0=\{\mathcal {W}\}\) the initial partition and by \(\mathcal {Q}_k=\{S_i\}_{i \in \mathcal {I}_k}\) the partition in iteration k of the BB algorithm, where \(\mathcal {I}_k\) is a finite index set, \(\mathcal {W}= \bigcup _{i\in \mathcal {I}_k} S_i\), and \(int(S_i) \cap int(S_j) = \emptyset \), for \(i \ne j\). Note that, strictly speaking, once we start fathoming subsets, \(\mathcal {Q}_k\) will no longer form a partition of \(\mathcal {W}\). However, for the ease of exposition, we will still call \(\mathcal {Q}_k\) a partition. At the beginning of step \(k\ge 1\), the partition \(\mathcal {Q}_{k-1}\) consists of subsets not yet deleted by the algorithm. To determine the subset of \(\mathcal {Q}_{k-1}\) to be partitioned, we follow the classical best-first rule, which selects the subset \(S^k \in \mathcal {Q}_{k-1}\) with the largest upper bound. The rationale for this rule is to pick a subset which is likely to contain a good feasible solution, which will, hopefully, allow for a quick increase in the global lower bound and thereby speed up the fathoming process. See Locatelli and Shoen [54] for other common rules.

First, we observe that our feasible set \(\mathcal {W}\) is identical to the standard n-simplex. In order to refine a partition \(\mathcal {Q}_{k-1}\), we follow Benson [14] and split the chosen subset \(S^k\) into two halves by simplicial bisection, which is a special case of radial subdivision introduced in Horst [42]:

Definition 6

(Radial subdivision) Let M be an n-simplex with vertex set \(\mathcal {V}(M)=\{\varvec{v}^0, \varvec{v}^1, \ldots , \varvec{v}^n \}\). Choose a point \(\varvec{m}\in M, \varvec{m}\notin \mathcal {V}(M)\) which is uniquely represented by

$$\begin{aligned} \varvec{m}=\sum \limits _{i=0}^n \lambda _i \varvec{v}^i, \text { }\lambda _i \ge 0 \text { }(i=0,\ldots ,n), \text { }\sum \limits _{i=0}^n \lambda _i =1, \end{aligned}$$

and for each i such that \(\lambda _i>0\) form the simplex \(M(i,\varvec{m})\) obtained from M by replacing the vertex \(\varvec{v}^i\) by \(\varvec{m}\), i.e., \(M(i,\varvec{m})=\{\varvec{v}^0, \ldots , \varvec{v}^{i-1}, \varvec{m}, \varvec{v}^{i+1}, \ldots , \varvec{v}^n\}\).

A simplicial bisection is obtained by choosing \(\varvec{m}\) as the midpoint of a longest edge of the simplex M, see Fig. 2 for an example. Horst and Tuy [43] prove that the set of subsets \(M(i,\varvec{m})\) that can be constructed from an n-simplex M by an arbitrary radial subdivision forms a partition of M into n-simplices. Hence, our subsets \(S_i\) are again n-simplices. Let \({\hat{\varvec{v}}}\) denote the midpoint of one of the longest edges of \(S^k\) and \(\varvec{v}^d\), \(\varvec{v}^e\) the corresponding endpoints of this edge. In the branching process, we replace \(S^k\) by the two n-simplices with vertex sets \(S_1^k= M(d,{\hat{\varvec{v}}})\) and \(S_1^k= M(e, {\hat{\varvec{v}}})\) using simplicial bisection to obtain a refined partition \(\mathcal {Q}_k = (\mathcal {Q}_{k-1}\setminus \{S^k\}) \cup \{S_1^k,S_2^k \}\).

Fig. 2
figure 2

Examples of subdivision of a 2-simplex: radial subdivison (a) and simplicial bisection (b)

3.2.2 Upper bounding process

Let \(S \in \mathcal {Q}_k\) be an n-simplex of the partition with vertices \(\{\varvec{v}^0, \varvec{v}^1, \ldots , \varvec{v}^n \}\). Initially, we follow Benson [14] and overestimate the objective function \(h(\varvec{w})=f(\varvec{w})/g(\varvec{w})\) by the ratio of two affine functions: one that overestimates f and one that underestimates g. We will improve these bounding functions in Sect. 3.2.5 in order to obtain tighter upper bounds and thereby increase the speed of convergence. The function g in the denominator is underestimated by a first order Taylor expansion around the barycenter \({\hat{\varvec{v}}}=1/(n+1)\sum _{i=0}^n \varvec{v}^i\) of the simplex S according to

$$\begin{aligned} g_S(\varvec{w})=g({\hat{\varvec{v}}})+\nabla _{\varvec{w}}g({\hat{\varvec{v}}})(\varvec{w}-{\hat{\varvec{v}}})^{\top }\,. \end{aligned}$$
(12)

As g is a convex function, \(g_S(\varvec{w}) \le g(\varvec{w})\), \(\varvec{w}\in \mathbb {R}^{n+1}\), and, hence, \(g_S\) is an underestimator of g. The gradient of the fourth central moment of the portfolio return is given by (see Appendix A)

$$\begin{aligned} \nabla _{\varvec{w}}g(\varvec{w})=4\mathbf {M}_4(\varvec{w}\otimes \varvec{w}\otimes \varvec{w}). \end{aligned}$$
(13)

In order to ensure that the approximation is positive, let

$$\begin{aligned} z(\varvec{w})\,=\,\max \{\alpha , g_S(\varvec{w}) \}, \qquad \text {where}\quad \alpha \,=\,\min \limits _{\varvec{w}\in \mathcal {W}} g(\varvec{w})\,. \end{aligned}$$
(14)

With g being convex, the minimization problem on the right-hand side can be solved efficiently.

In order to construct a linear overestimator of the function f in the numerator we need the following definition given in Horst and Tuy [43]:

Definition 7

(Concave envelope) The concave envelope of a function p taken over a nonempty subset M of its domain is the function \(p^M\) that satisfies:

$$\begin{aligned} (i) \text { }&p^M \text { is a concave function defined over the convex hull of } M, \\ (ii) \text { }&p^M(\varvec{x}) \ge p(\varvec{x}), \text { for all } \varvec{x}\in M, \text { and} \\ (iii) \text { }&\text {if } q \text { is a concave function defined over the convex hull of } M \text { that satisfies } q(\varvec{x}) \ge p(\varvec{x}) \\&\text {for all } \varvec{x}\in M, \text { then } q(\varvec{x}) \ge p^M(\varvec{x}) \text { for all } \varvec{x}\text { in the convex hull of } M. \end{aligned}$$

Horst [42] shows that when M is an n-simplex and p is a convex function on M, then \(p^M\) is the unique affine function that coincides with p at the vertices of M. Denoting by \(f^S(\varvec{w})\) the concave envelope of f over S, we construct the following upper bound for the maximum of h over S

$$\begin{aligned} UB(S)=\max \limits _{\varvec{w}\in S} \dfrac{f^S(\varvec{w})}{z(\varvec{w})}. \end{aligned}$$
(15)

Since \(z(\varvec{w})\ge 0\), \(\varvec{w}\in \mathbb {R}^{n+1}\), and \(f^S(\varvec{w})\ge f(\varvec{w}) > 0\), \(\varvec{w}\in S\), UB(S) is equal to the optimal value of the following problem:

$$\begin{aligned} \text {(P1}\mathrm{(}S\mathrm{))} \quad \max \limits _{t,\varvec{w}\in S} \quad&\dfrac{f^S(\varvec{w})}{t} \nonumber \\ \text {s.t.} \quad&t\ge \alpha , \end{aligned}$$
(16)
$$\begin{aligned}&t-g_S(\varvec{w})\ge 0. \end{aligned}$$
(17)

As \(S \subseteq \mathcal {W}\) is compact and the objective function is continuous, (P1(S)) has an optimal solution. Moreover, as the ratio of two linear functions is quasi-concave, every local optimum over the closed convex set is also a global optimum. Thus, the fractional program can be solved to global optimality with any local solver. However, as P1(S) has to be solved many times during the BB algorithm, we follow Benson [14] and reformulate the problem as follows. Each \(\varvec{w}\in S\) can be written as

$$\begin{aligned} \varvec{w}\,=\, \sum _{i=0}^n \lambda _i \varvec{v}^i, \quad \text {where}\quad \lambda _i \ge 0, i=0,1,\ldots ,n, \text { and }\sum _{i=0}^n \lambda _i =1, \end{aligned}$$
(18)

(see [43]). As \(f^S(\varvec{w})\) is an affine function, we then get \(f^S(\varvec{w}) \,=\, \sum _{i=0}^n \lambda _i f(\varvec{v}^i)\). Substituting \(f^S(\varvec{w})\) and adding the conditions for \(\varvec{w}\) gives the equivalent fractional program

$$\begin{aligned} \text {(P2}\mathrm{(}S\mathrm{))} \quad \max \limits _{t,\lambda ,\varvec{w}} \quad&\dfrac{1}{t}\sum _{i=0}^n \lambda _i f(\varvec{v}^i) \nonumber \\ \text {s.t.} \quad&(16), (17), \nonumber \\&\varvec{w}= \sum _{i=0}^n \lambda _i \varvec{v}^i, \end{aligned}$$
(19)
$$\begin{aligned}&\sum _{i=0}^n \lambda _i =1, \end{aligned}$$
(20)
$$\begin{aligned}&\lambda _i \ge 0, \quad i=0,1,\ldots ,n. \end{aligned}$$
(21)

To linearize the objective function, we apply the Charnes–Cooper transformation [17], performing the following change of variables

$$\begin{aligned} u=\dfrac{1}{t}, \quad b_i=\dfrac{\lambda _i}{t}, \quad \varvec{y}=\dfrac{\varvec{w}}{t}, \end{aligned}$$
(22)

where \(\varvec{y}=[y_0, \ldots , y_n]^{\top }\) and \(\varvec{b}=[b_0, \ldots , b_n]^{\top }\), which results in the equivalent problem

$$\begin{aligned} \text {(P3'(}S\mathrm{))} \quad \max \limits _{u,\varvec{b},\varvec{y}} \quad&\sum \limits _{i=0}^n b_i f(\varvec{v}^i) \nonumber \\ \text {s.t.} \quad&u\le 1/\alpha , \end{aligned}$$
(23)
$$\begin{aligned}&u \cdot g_{S}(\varvec{y}/u)\le 1, \end{aligned}$$
(24)
$$\begin{aligned}&\varvec{y}=\sum \limits _{i=0}^n b_i\varvec{v}^i, \end{aligned}$$
(25)
$$\begin{aligned}&\sum \limits _{i=0}^n b_i-u=0, \end{aligned}$$
(26)
$$\begin{aligned}&b_i \ge 0, \text { }i=0,\ldots ,n, \nonumber \\&u > 0. \end{aligned}$$
(27)

Since \(u\cdot g_{S}(\varvec{y}/u)\) is an affine function, (P3’(S)) is a linear program, except for the domain constraint on u. However, Avriel et al. [7] showed that when a solution to (P2(S)) exists, then the strict inequality can be replaced by \(u\ge 0\), and we obtain the linear program

$$\begin{aligned} \text {(P3(}S\mathrm{))} \quad \max \limits _{u,\varvec{b},\varvec{y}} \quad&\sum \limits _{i=0}^n b_i f(\varvec{v}^i) \nonumber \\ \text {s.t.} \quad&(23)-(27), \nonumber \\&u \ge 0. \end{aligned}$$
(28)

This formulation can now be solved very efficiently using any linear programming solver.

Finally, the upper bound for \(S_l^k\), \(l=1,2\), is now computed as \(\min \{UB(S_l^k), UB(S^k)\}\). Moreover, for each iteration \(k\ge 0\), the upper bounding process also computes an upper bound \(UB_k\) for the global optimal value \(h(\varvec{w}^*)\) of the original problem (P) based on the partition \(\mathcal {Q}_k\):

$$\begin{aligned} UB_k=\max \limits _{S \in \mathcal {Q}_k} UB(S)\,. \end{aligned}$$
(29)

By construction, the upper bound is monotonically decreasing in k, i.e., \(UB_{k+1} \le UB_k\), \(k \ge 0\).

3.2.3 Lower bounding process

Denoting by \(\varvec{w}^k\) the best solution of the problems (P1(S)) encountered up to iteration k, the lower bound \(LB_k\) for the global optimal value \(h(\varvec{w}^*)\) in iteration k is given by \(LB_k=h(\varvec{w}^k)\). The bounds are monotonically increasing in k: \(LB_{k+1}\ge LB_k\), \(k\ge 0\).

3.2.4 Fathoming process

Based on the lower and upper bounds produced by the algorithm, the fathoming process deletes all subsets \(S \in \mathcal {Q}_{k-1}\) from \(\mathcal {Q}_{k-1}\) that are guaranteed not to contain the global optimal solution. At the beginning of each iteration, i.e., all \(S \in \mathcal {Q}_{k-1}\) are removed for which \((1-\rho )UB(S) \le LB_{k-1}\). If this results in \(\mathcal {Q}_{k-1}\) being empty, then

$$\begin{aligned} h(\varvec{w}^{k-1}) \,\ge \, (1-\rho )UB_{k-1} \,\ge \, (1-\rho )\max _{\varvec{w}\in \mathcal {W}} h(\varvec{w}) \,=\, (1-\rho )h(\varvec{w}^*), \end{aligned}$$
(30)

which means that \(\varvec{w}^{k}\) is a \(\rho \)-globally optimal solution to problem (P). Benson [14] shows that when the number of iterations for the BB algorithm is infinite, it generates two sequences of points whose accumulation points are the global optimal solution \(\varvec{w}^*\) for (P), and

$$\begin{aligned} \lim _{k \rightarrow \infty } LB_k=\lim _{k \rightarrow \infty } UB_k=h(\varvec{w}^*). \end{aligned}$$
(31)

This result implies that whenever \(\rho >0\), the BB algorithm is finite.

The complete BB algorithm is summarized below.

BB algorithm

Input: \(\rho \in [0,1)\), n-simplex \(\mathcal {W}\), functions \(f(\cdot )\) and \(g(\cdot )\).

Output: \(\rho \)-globally optimal solution \({\tilde{\varvec{w}}}\).

Initialization Set \(S^0= \mathcal {W}\) and \(\mathcal {Q}^0=\{S^0\}\). Calculate \(UB_0=UB(S^0)\) and an optimal solution \((\varvec{w}^{0},t_{0})\) for P1(\(S^0\)). Set \(LB_0=h(\varvec{w}^0)\).

If \((1-\rho )UB_0\le LB_0\), then stop; \({\tilde{\varvec{w}}}=\varvec{w}^0\) is \(\rho \)-globally optimal for (P).

Step \({\varvec{k}}\) (\(k=1,2, \ldots \))

   k.1    Delete each n-simplex \(S \in \mathcal {Q}_{k-1}\) from \(\mathcal {Q}_{k-1}\) for which \((1-\rho )UB(S) \le LB_{k-1}\).

   k.2    If \({\mathcal {Q}}_{k-1}=\emptyset \), then stop: \({\tilde{\varvec{w}}}=\varvec{w}^{k-1}\) is \(\rho \)-globally optimal for problem (P).

   k.3    Let \(UB_k=\max \{UB(S) \mid S\in \mathcal {Q}_{k-1} \}\) and choose an n-simplex \(S^k \in \mathcal {Q}_{k-1}\) such that \(UB_k=UB(S^k)\). Subdivide \(S^k\) into two n-simplices \(S_1^k,S_2^k\) via simplicial bisection.

   k.4    For \(S=S_1^k, S_2^k\), find the optimal value UB(S) and an optimal solution \((\varvec{w}^S,t_S)\) for P1(S), and set \(UB(S)=\min \{UB(S), UB(S^k)\}\).

   k.5    Set \(LB_{k}=\max \{LB_{k-1}\), \(h(\varvec{w}^{S_k^1}), h(\varvec{w}^{S_k^2})\}\) and let \(\varvec{w}^{k}\) satisfy \(LB_{k}=h(\varvec{w}^{k})\). Set \(\mathcal {Q}_k={\mathcal {Q}}_{k-1} \setminus \{ S^k\} \cup \{S_1^k,S_2^k \}\) and \(k=k+1\).

Remark

For the case with additional constraints, such as position limits, the feasible set is no longer given by the standard n-simplex. The extension of the algorithm to a more general case is however straightforward (see [14], for details).

3.2.5 Improving the upper bound

Preliminary computational tests showed that the BB algorithm spends the vast majority of the computing time calculating the upper bound UB(S) over the n-simplex S. Moreover, while it often took only a few iterations to obtain a very good lower bound \(LB_k\) on the optimal value \(h(\varvec{w}^*)\), the upper bound was improving only very slowly. In order to achieve faster convergence for the BB algorithm, we present in the following two extensions of the algorithm presented in Benson [14] that lead to a much faster reduction of the global upper bound \(UB_k\). In the first, the lower bound of the function g in the denominator is improved by adding affine functions to the approximation. In the second, the upper bound of the function f in the numerator is enhanced by using a generalization of the concave envelope. This generalization requires the introduction of binary variables, which means that the improved upper bound comes with the cost of having to solve a more difficult combinatorial optimization problem.

To tighten the lower bound for the function g, we extend the linearization technique in (14) by adding first order Taylor expansions of g around p additional points \(R_j\) in S. We then define

$$\begin{aligned} {\tilde{z}}(\varvec{w})\,=\, \max (\alpha ,g_S(\varvec{w}),g_{R_1}(\varvec{w}),\ldots ,g_{R_p}(\varvec{w})), \end{aligned}$$
(32)

where \(g_{R_j}(\varvec{w})=g({\varvec{R}}_j)+\nabla _{\varvec{w}}g({\varvec{R}}_j) (\varvec{w}-{\varvec{R}}_j)^{\top }\), \(j=1, \ldots ,p\). The idea of the improvement is illustrated in Fig. 3 for the case when S is a 1-simplex and \(p=2\).

Fig. 3
figure 3

Improvement of the lower bound for g when S is an 1-simplex: original lower bound (a) and improved lower bound for \(p=2\) (b)

For the general case of an n-simplex S, the locations of the points \(\{R_j\}_{j=1}^p\) are chosen so that they are evenly distributed in S, see Sect. 3.3 for more details. The resulting problem is then given as

$$\begin{aligned} \text {(P4(}S\mathrm{))} \quad \max \limits _{u,\varvec{b},\varvec{y}} \quad&\sum \limits _{i=0}^n b_i f(\varvec{v}^i) \nonumber \\ \text {s.t.} \quad&(23)-(27), \nonumber \\&u \ge 0, \nonumber \\&u\cdot g_{R_j}(\varvec{y}/u) \le 1, \; j=0,\ldots ,p. \end{aligned}$$
(33)

Obviously, the accuracy of the approximation increases with p, at the expense of adding more linear constraints to the optimization problem.

Next, we turn our attention to improving the accuracy of the approximation of the numerator of the objective function. We start by subdividing the n-simplex S by radial subdivision according to Definition 6. Let the set of n-simplices created by the radial subdivision be given by \({\mathcal {T}}=\{ S_j\}_{j\in \mathcal {J}}\) and the corresponding set of all vertices by \({\mathcal {V}}({\mathcal {T}})=\{\varvec{v}^i \}_{i\in \mathcal {I}}\). The improved upper bound is then constructed by the combination of the concave envelopes over the n-simplices in \(\mathcal {T}\). The construction is more easily illustrated by the simplest possible example in one dimension given in Fig. 4b. For this example, the set of n-simplices and corresponding vertices are after the radial subdivision given by \(\mathcal {T}=\{S_1,S_2\}\) and \(\mathcal {V}(\mathcal {T})=\{\varvec{v}^1,\varvec{v}^2,\varvec{v}^3 \}\), respectively. The generalized concave envelope over S is constructed from the concave envelopes over \(S_1\) and \(S_2\).

Fig. 4
figure 4

Improvement of the upper bound for f when S is an 1-simplex: original upper bound (a) and improved upper bound (b)

When calculating the concave envelope, we have to introduce binary variables \(q_j\), \(j \in \mathcal {J}\), in order to keep track of which n-simplex in \(\mathcal {T}\) is active. The function representing the generalized concave envelope over the n-simplex S can now be formulated as

$$\begin{aligned}&\sum \limits _{i\in \mathcal {I}} \lambda _i f(\varvec{v}^i) \nonumber \\&(20), (21), \nonumber \\&\sum \limits _{j\in \mathcal {J}}q_j=1, \end{aligned}$$
(34)
$$\begin{aligned}&\lambda _i \le \sum \limits _{j:\text { }\varvec{v}^i \in S_j}q_j, \text { }i \in \mathcal {I}, \end{aligned}$$
(35)
$$\begin{aligned}&q_j \in \{0,1\},\text { }j \in \mathcal {J}. \end{aligned}$$
(36)

Condition (35) ensures that only \(\lambda _i\)’s belonging to vertices of the n-simplex that is active, i.e. for which \(q_j=1\), can be non-zero. Using the improved approximation function, we obtain the optimization problem

$$\begin{aligned} \begin{aligned} \text {(P5(}S\mathrm{))} \quad&\max \limits _{t,\varvec{\lambda },\varvec{q},\varvec{w}}&\dfrac{1}{t}\sum _{i\in \mathcal {I}} \lambda _i f(\varvec{v}^i) \\&\text {s.t.}&(16), (17), (19){-}(21), (34){-}(36). \end{aligned} \end{aligned}$$

As before, we transform this problem into a mixed-integer linear program (MILP) via Charnes–Cooper transformation through the variable transformations in (22). The last set of constraints is transformed into

$$\begin{aligned} b_i -\sum \limits _{j:\text { }\varvec{v}^i \in S_j}q_ju\le 0,\text { }i \in \mathcal {I}, \end{aligned}$$
(37)

in the new variables. The product of variables is linearized by introducing the continuous variables \(z_j=q_j u\), \(j \in \mathcal {J}\), and adding the following constraints for each \(j \in \mathcal {J}\):

$$\begin{aligned}&z_j \le \dfrac{q_j}{\alpha }, \quad z_j \le u, \quad z_j \ge u-(1-q_j)/\alpha , \quad \text {and}\quad z_j \ge 0. \end{aligned}$$
(38)

If \(q_j=0\), then the first and last constraint ensures that \(z_j=0\), while the third only states that \(z_j\) has to be greater than a negative number. If \(q_j=1\), then the first constraint enforces \(z_j \le 1/\alpha \), and the second and third ensure that \(z_j=u\). Summing up, we obtain the following equivalent MILP, which we denote by (P6(S))

$$\begin{aligned} \begin{aligned} \text {(P6(}S\mathrm{))}\quad&\max \limits _{u,\varvec{b},\varvec{z},\varvec{q},\varvec{y}}&\sum \limits _{i\in \mathcal {I}} b_i f(\varvec{v}^i)\\&\text {s.t.}&(23){-}(27), (34), (36), (37), (38),\\&u \ge 0. \end{aligned} \end{aligned}$$

This problem can easily be enhanced by adding linear terms to the constraints in order to obtain a better approximation of the function g in the denominator. The hope is that the improved upper bound in (P6(S)) will lead to a sufficiently fast decrease of the global upper bound \(UB_k\) in order to compensate for the increased computational time induced by solving an MILP instead of an LP for each instance of the upper bounding process.

3.3 Numerical implementation

In this section we will demonstrate the convergence properties of the BB algorithm when applied to the problem of minimizing the portfolio kurtosis for an increasing number of assets. As sample problem we assume that all assets have identical marginal distributions and that all correlations between different assets are assumed to be equal. This problem instance represents a non-convex problem with multiple local optima. When the subproblems for the BB algorithm are given by the MILP (P6(S)), the description in Sect. 3.2 needs to be extended in order to produce an efficient algorithm. Numerical experiments show that radial subdivision does not improve the upper bound of f sufficiently to produce an efficient algorithm. In order to further improve the upper bound of f, the n-simplex S is instead subdivided with barycentric subdivision. Roughly speaking, the barycentric subdivision of an n-simplex S is obtained by radial subdivision of all k-faces of dimension \(1\le k \le n\) in decreasing order of dimension. It is also possible with partial barycentric subdivision of S by restricting the radial subdivision to all k-faces of dimension \(l \le k \le n\), with \(l>1\), in decreasing order of dimension (see [2], for a detailed description of barycentric subdivision). The partial barycentric subdivision of a 2-simplex with \(l=2\) corresponds to radial subdivision as illustrated in Fig. 5 a. The full barycentric subdivision of the 2-simplex is illustrated in Fig. 5 b. Numerical experimentation reveals that full barycentric subdivision is required in order to produce a sufficiently improved upper bound of f for the MILP formulation. Unfortunately, this means that \((n+1)!\) binary variables need to be introduced when solving the subproblems with the MILP formulation. The formulation of the optimization problem (P6(S)) does not change when subdividing the n-simplex with barycentric subdivision instead of with radial subdivision. The only thing that changes is the set of n-simplices created by the subdivision and the corresponding set of vertices.

Fig. 5
figure 5

Examples of subdivision of a 2-simplex: radial subdivison (a) and barycentric subdivison (b)

In the following we investigate the improvements obtained in terms of iteration count and runtime, when using the enhanced LP formulation and the MILP formulation for solving the subproblems of the algorithm. The BB algorithm was implemented in MATLAB. For all LP formulations of the subproblems we use the solver CPLEX in the implementation, whereas the subproblems arising from the MILP formulation are solved with the built-in solver intlinprog in MATLAB. For all comparisons we set the parameter \(\rho \) to \(10^{-3}\). When using the enhanced LP formulation (P4(S)), a choice has to be made regarding how many extra constraints p are added to the problem. The p points defining the added constraints are distributed evenly over the subsimplex S for which the subproblem is solved. Letting \(n_c\) denote the number of added constraints per asset, one has that \(p=(n+1)n_c\). We choose to distribute the p points evenly between the vertices \(\{\varvec{v}^i \}_{i=0}^n\) and the barycenter \({\hat{\varvec{v}}}\) of S. Thus, for \(n_c=1\) the p points are defined by the vertices. For \(n_c\ge 2\), the p points are defined by the vertices and the \((n+1)(n_c-1)\) points

$$\begin{aligned} \dfrac{j}{n_c}\varvec{v}^i+\left( 1-\dfrac{j}{n_c}\right) {\hat{\varvec{v}}}, \text { }i=0,\ldots ,n; \text { }j=1,\ldots , n_c-1. \end{aligned}$$
(39)

We also experimented with distributing points evenly between the vertices but that did not bring any noticeable improvement in terms of iteration count. Naturally, adding more constraints in order to obtain a tighter lower bound should decrease the iteration count for the BB algorithm, at the cost of increasing the runtime for each of the subproblems that are solved. This trade-off is now investigated. In the following, we denote the enhanced LP formulation (P4(S)) by LP2 and the LP formulation (P3(S)) as used in Benson [14] by LP1.

Fig. 6
figure 6

a Evolution of the global lower and upper bounds of the portfolio kurtosis for the original (LP1) and enhanced LP model (LP2) with \(n_c=2\) for the three asset problem. b The fraction of deleted simplices for the original and enhanced LP model for the three asset problem

Figure 6a displays the evolution of the global lower and upper bounds of the portfolio kurtosis for the iterations of the BB algorithm applied to the three asset problem. Note that these are the inverses of the global lower and upper bounds calculated by the BB algorithm, i.e., for \(\kappa _p(\varvec{w})=g(w)/f(w)\). Simulated return data is used to calculate the moment matrices in the objective function (11). The sample moment matrices \(\hat{{\mathbf {M}}}_2\) and \(\hat{{\mathbf {M}}}_4\) are calculated from \(10^7\) simulated asset returns with NIG-distributed margins and dependence structure given by the Gaussian copula with a homogeneous correlation matrix. The problem instance is defined by the homogeneous correlation \(\rho =-0.2\) and marginal kurtosis \(\kappa _m=6\) for all the assets. Appendix B contains a description of the simulation procedure. Figure 6b shows the fraction of deleted simplices for the iterations of the algorithm for the three asset problem. Note that the fraction of deleted simplices decreases if the number of deleted simplices is less than the number of simplices that are added by the subdivision procedure. From the graphs it is visible that the enhanced LP formulation LP2 with \(n_c=2\) converges faster to the global optimum in terms of number of iterations compared to the original LP formulation LP1. As can be seen in Table 1, the number of iterations decreases with the number of extra constraints p added to the problem. However, as displayed in Table 2 the decrease in the number of iterations is not significant enough in order to compensate for the increased runtime associated with the larger number of constraints. Thus, the enhanced LP formulation with \(p=1\) has a lower runtime than the formulations with \(p>1\). Compared to the original LP formulation LP1, the runtime of LP2 with \(n_c=1\) is the same for the three asset problem.

Fig. 7
figure 7

a Evolution of the global lower and upper bounds of the portfolio kurtosis for the original (LP1) and enhanced LP model (LP2) with \(n_c=2\) for the five asset problem. b The fraction of deleted simplices for the original and enhanced LP model for the five asset problem

Figure 7 a, b display the evolution of the global lower and upper bounds of the portfolio kurtosis and the fraction of deleted simplices for the iterations of the BB algorithm applied to the five asset problem. The solid and dotted lines represent the evolution of the bounds and the fraction of deleted simplices for LP2 with \(n_c=2\) and LP1, respectively. The graphs reveal that there is a significant decrease in iteration count when using the enhanced LP formulation LP2 compared to LP1. As for the three asset case, the iteration count for LP2 decreases with the number of added constraints p. However, as can be seen in Table 2 the decrease in iteration count does not compensate for the added computational cost and hence LP2 with \(n_c=1\) has the lowest runtime. LP2 with \(n_c=1\) also has a lower runtime than LP1 for the five asset case, 175 seconds compared to 193 seconds.

Table 1 Number of iterations for the BB algorithm for different portfolio sizes and solution methods for the subproblems.
Table 2 Runtime for the BB algorithm for different portfolio sizes and solution methods for the subproblems.

We will now investigate the performance of the MILP formulation against the LP formulation with the lowest runtime for the five asset problem. Figure 8 a, b show the evolution of the global lower and upper bounds of the portfolio kurtosis and the fraction of deleted simplices for the two cases. The solid and dotted lines represent the evolution of the bounds and the fraction of deleted simplices for the MILP formulation and LP2 with \(n_c=1\), respectively. From Fig. 8, one observes that the MILP formulation improves the global lower bound of the kurtosis, corresponding to the global upper bound for the BB algorithm, much faster than the LP formulation up to around iteration count 5000. After that, the global lower bound of the MILP formulation improves slower than for the LP formulation. The overall iteration count is lower for the MILP formulation compared to LP2. However, the improvement in iteration count does not compensate for the increased computational cost associated with solving a MILP instead of an LP as can be seen in Table 2. The runtime for the MILP formulation can however likely be reduced by using a state-of-the art solver instead of the built-in solver in MATLAB.

For the six asset problem with homogeneous correlation \(\rho =-0.18\), the number of iterations is 620,000 for the best performing solution method for the subproblems, LP2 with \(n_c=1\). The corresponding runtime for the six asset problem is 49,680 seconds, illustrating the exponential growth in computational effort when the BB algorithm is applied to the portfolio kurtosis minimization problem. The BB algorithm can be enhanced by developing special purpose solvers for the subproblems. Furthermore, the algorithm can be parallelized in order to further reduce the runtime. Moreover, we could combine the MILP formulation and LP2, starting with the former to quickly raise the upper bound, and then switch to LP2 to save runtime. It is, however, unlikely that any of these will admit solving problems with significantly higher number of assets than six.

Fig. 8
figure 8

a Evolution of the global lower and upper bounds of the portfolio kurtosis for the enhanced LP model with \(n_c=1\) and the MILP model for the five asset problem. b The fraction of deleted simplices for the enhanced LP model and the MILP model for the five asset problem

4 Stochastic global optimization

In Sect. 3.2 we developed a deterministic global optimization algorithm for minimizing the inverse of the introduced portfolio diversification measures. However, as is well known and illustrated by the numerical examples in Sect. 3.3, the BB algorithm suffers from the curse of dimensionality and converges too slowly for problems where the number of assets exceeds six. In this section we develop a stochastic optimization algorithm for global optimization of portfolio kurtosis. The BB algorithm has the desirable property that the objective function value at the obtained solution is guaranteed to be arbitrarily close to the global minimum. For the algorithm developed in this section it is not possible to determine if the solution is a global optimum. However, the algorithm is a special case of stochastic approximation with a rich and well developed theory for convergence analysis. Since the BB algorithm is limited to problems of moderate size, the algorithm developed in this section complements the BB algorithm in the sense that it allows for tackling problems of larger size.

4.1 Stochastic algorithms for global optimization—a very brief overview

There is a huge literature on global optimization algorithms, so called metaheuristic methods, for which it is not possible to guarantee that the obtained solution is a global optimum. These methods iteratively search the feasible set for the global optimum and without prior knowledge there is always the possibility that the optimal point lies in an unexplored region when the algorithm stops. Important examples of metaheuristic methods are genetic algorithms [41], simulated annealing [48] and tabu search [37]. The interested reader may consult Gendreau and Potvin [36] for an overview of metaheuristic methods. Even though, for metaheuristic methods, it is not possible to guarantee that a global optimal point has been found, algorithms that are based on stochastic approximation have a solid theoretical foundation and in many cases non-asymptotic convergence results are available with explicit constants, see Dalalyan [26], Durmus and Moulines [28] and Durmus and Moulines [29]. This can be contrasted to many other popular metaheuristic methods where the theory is often incomplete or even nonexistent, see Spall [72]. A strong aspect of stochastic approximation is the rich convergence theory that has been developed over many years. It has been used to show convergence of many stochastic algorithms such as neural network backpropagation and simulated annealing. For rigorous examples where stochastic approximation methods are applied to problems in finance, see Laurelle and Pages [52] and Sabanis and Zhang [65]. The latter, and more recent result, offers theoretical guarantees for the discovery of near-optimal solutions of a non-convex optimization problem, namely the optimal allocation of weights for the (unconstrained via a suitable transformation) minimization of CVaR/Expected Shortfall of a portfolio of assets.

In stochastic approximation one is concerned with finding at least one root \({\varvec{\theta }}^* \in {\varvec{\Theta }}^*\subseteq \mathbb {R}^d\) to \(G({\varvec{\theta }})=0\), based on noisy measurements of \(G({\varvec{\theta }})\). Root finding via stochastic approximation was introduced in Robbins and Monro [62] and important generalizations were made in Kiefer and Wolfowitz [47]. Consider the unconstrained minimization problem

$$\begin{aligned} \min \limits _{{\varvec{\theta }}} L({\varvec{\theta }}), \end{aligned}$$
(40)

where L is a smooth function, which has multiple local minima. For the special case when \(G({\varvec{\theta }})\) is given by \(G({\varvec{\theta }})=\nabla _{{\varvec{\theta }}} L({\varvec{\theta }})\), the stochastic approximation algorithm is given by the following stochastic gradient descent (SGD) algorithm

$$\begin{aligned} {\varvec{\theta }}_{k+1}={\varvec{\theta }}_k-a_k H({\varvec{\theta }}_k,\varvec{X}_{k+1}), \end{aligned}$$
(41)

where \(\{\varvec{X}_k\}_{k\in {\mathbb {Z}}}\) is a sequence of \(\mathbb {R}^m\)-valued i.i.d. data and \(H({\varvec{\theta }}_k,\varvec{X}_{k+1})\) is an unbiased estimate of the gradient, i.e. \(\nabla _{{\varvec{\theta }}} L({\varvec{\theta }})=\mathbb {E}(H({\varvec{\theta }},\varvec{X}_{k+1})\). In (41), \(\{a_k\}\) can either be a decreasing positive sequence satisfying appropriate conditions or a fixed small positive value \(a_k=\lambda >0\), for any \(k\ge 0\).

In many estimation problems, a full set of data is collected and G (or L) is chosen by conditioning on the data. This conditioning removes the randomness from the problem and the estimation problem becomes deterministic. In the machine learning literature this is commonly referred to as the batch gradient descent algorithm, which is given by \({\varvec{\theta }}_{k+1}={\varvec{\theta }}_k-a_k{\bar{H}}({\varvec{\theta }}_k)\), where \(\{\varvec{X}_k\}_{k=1}^N\) is the collected data and

$$\begin{aligned} {\bar{H}}({\varvec{\theta }})=\dfrac{1}{N}\sum \limits _{k=1}^NH({\varvec{\theta }},\varvec{X}_k). \end{aligned}$$
(42)

Since L has multiple local minima, applying SGD to (40) may yield convergence to a local minimum of L. Under broad conditions, Kushner and Yin [50] show that (41) converges to one of the local minima of L with probability 1. However, the iterates will often be trapped at a local optimum and will miss the global one. Nevertheless, SGD, or one of its various extensions, is commonly used in machine learning for optimization of Deep Neural Networks, see Goodfellow et al. [38]. When L has a unique minimum, Chau et al. [18] provide convergence results for the case with dependent data, discontinuous L, and fixed step size.

The idea behind simulated annealing is that by adding an additional noise term to the iterations one can avoid getting prematurely trapped in a local minimum of L. In Gelfand and Mitter [35], the following modified SGD algorithm is analyzed

$$\begin{aligned} {\varvec{\theta }}_{k+1}={\varvec{\theta }}_k-a_k H({\varvec{\theta }}_k,\varvec{X}_{k+1})+b_k {\varvec{\epsilon }}_{k+1}, \end{aligned}$$
(43)

where \(\{{\varvec{\epsilon }}_k \}\) is a sequence of standard d-dimensional independent Gaussian random variables, and \(\{a_k\}\) and \(\{b_k\}\) are decreasing sequences of positive numbers tending to zero. They show that under suitable assumptions, \({\varvec{\theta }}_k\) tends to the global minimizer as \(k \rightarrow \infty \) in probability. In the machine learning and Bayesian inference literature, the closely related Stochastic Gradient Langevin Dynamics (SGLD) algorithm has attracted significant interest in the research community in recent years. The SGLD algorithm for global optimization can be formulated as

$$\begin{aligned} {\varvec{\theta }}_{k+1}={\varvec{\theta }}_k-a_k H({\varvec{\theta }}_k,\varvec{X}_{k+1})+\sqrt{2a_k/\beta } {\varvec{\epsilon }}_{k+1}, \end{aligned}$$
(44)

where \(\{{\varvec{\epsilon }}_{k}\}\) is a sequence of standard d-dimensional independent Gaussian variables and \(\beta >0\) is a temperature parameter. The batch version of this algorithm, Gradient Langevin Dynamics (GLD), is correspondingly given by

$$\begin{aligned} {\varvec{\theta }}_{k+1}={\varvec{\theta }}_k-a_k{\bar{H}}({\varvec{\theta }}_k)+\sqrt{2a_k/\beta } {\varvec{\epsilon }}_{k+1}. \end{aligned}$$
(45)

Assuming that the gradient H is Lipschitz continuous and under further assumptions, Raginsky et al. [61] provide a non-asymptotic analysis of SGLD and GLD applied to non-convex problems for the case when the step size \(a_k\) is a positive constant. The analysis provides non-asymptotic guarantees for SGLD and GLD to find an approximate minimizer. The rate of convergence is further improved for both SGLD and GLD in the recent papers by Xu et al. [77] and in Chau et al. [19] even in the presence of dependent data streams.

4.2 A Gradient Langevin Dynamics algorithm for minimization of kurtosis

Motivated by the enormous progress in the aforementioned optimization algorithms, we develop a GLD algorithm for global minimization of portfolio kurtosis. Since portfolio kurtosis is the ratio of two convex functions, the batch gradient is not simply given by the average as in (42). Given a sample of observed return data for a given asset universe, the sample covariance matrix and sample fourth co-moment matrix can be estimated. The batch version of the portfolio kurtosis is then given by

$$\begin{aligned} {\bar{h}}(\varvec{w})=\dfrac{{\bar{f}}(\varvec{w})}{{\bar{g}}(\varvec{w})}=\dfrac{\varvec{w}^{\top }\hat{{\mathbf {M}}}_4(\varvec{w}\otimes \varvec{w}\otimes \varvec{w})}{(\varvec{w}^{\top }\hat{{\mathbf {M}}}_2\varvec{w})^2}, \end{aligned}$$
(46)

where \(\hat{{\mathbf {M}}}_2\) and \(\hat{{\mathbf {M}}}_4\) denote the sample covariance and fourth co-moment matrices, respectively. Given the complicated form of the approximate bias for sample kurtosis, see Bao [9], global minimization of portfolio kurtosis is not easily adapted to the algorithms in Sect. 4.1 which utilize a stochastic unbiased estimate of the gradient. For this reason we only develop a GLD algorithm for the global minimization problem.

The algorithms in Sect. 4.1 are formulated for unconstrained optimization problems and hence need to be adapted to constrained minimization over the standard n-simplex. The GLD algorithm for the constrained problem is given by the following projected iterations

$$\begin{aligned} \varvec{w}_{k+1}=\Pi _{\mathcal {W}}\left( \varvec{w}_k-\lambda \nabla _{\varvec{w}}{\bar{h}}(\varvec{w}_k)+\sqrt{2\lambda /\beta }{\varvec{\epsilon }}_{k+1}\right) , \end{aligned}$$
(47)

where \(\Pi _{\mathcal {W}}\) denotes the Euclidean projection onto the feasible set, \(\lambda >0\) is the fixed step size and \(\varvec{w}\in \mathbb {R}^{n+1}\). Euclidean projection of a point onto the standard n-simplex is a quadratic program which can be solved very efficiently. See Chen and Ye [20] for a fast and simple algorithm for computing the projection onto the standard n-simplex. The gradient of the batch version of portfolio kurtosis is given by

$$\begin{aligned} \nabla _{\varvec{w}}{\bar{h}}(\varvec{w})=\dfrac{\nabla _{\varvec{w}}{\bar{f}} (\varvec{w})}{{\bar{g}}(\varvec{w})}-\dfrac{ {\bar{f}} (\varvec{w})\nabla _{\varvec{w}}{\bar{g}}(\varvec{w})}{\left( {\bar{g}}(\varvec{w})\right) ^2}, \end{aligned}$$
(48)

where the explicit form of the gradient \(\nabla _{\varvec{w}}{\bar{f}}(\varvec{w})\) is given in Appendix A and

$$\begin{aligned} \nabla _{\varvec{w}}{\bar{g}}(\varvec{w})=4\left( \varvec{w}^{\top }\hat{{\mathbf {M}}}_2\varvec{w}\right) \hat{{\mathbf {M}}}_2 \varvec{w}. \end{aligned}$$
(49)

Most convergence results for SGLD and GLD are only applicable for algorithms without projection. A natural way to avoid the projection step in each iteration would be to extend the objective function with a convex function outside of the feasible set. Naturally, the extended objective function needs to be continuous on the boundary of the feasible set and have a continuous gradient on the boundary. However, in Tawarmalani and Sahinidis [74] it is shown that a sufficient condition for the existence of a convex extension of a function outside of a convex feasible set, is the convexity of the function. Even if the requirement of convexity of the function to be extended is relaxed such that convexity is only required close to the boundary of the feasible set, this does not hold for portfolio kurtosis. It can easily be shown that portfolio kurtosis in general is a non-convex function on the boundary of the feasible set. Hence, it is not possible to find a convex extension of portfolio kurtosis outside of the feasible set \(\mathcal {W}\).

When the objective function is convex, Bubeck et al. [16] provide convergence results for the projected SGLD and GLD algorithms. In the case of a non-convex objective function, no convergence results for projected SGLD and GLD currently exist in the literature to the best of our knowledge, apart from Sabanis and Zhang [65] where the projection is achieved implicitly, via a transformation, under the assumptions of Lipschitz continuity and dissipativity for the gradient of the objective function. These conditions also hold true in our kurtosis minimization problem, see (53) and (57), which provide the theoretical justification for the choice of the proposed projected GLD algorithm as our global optimization approach in higher dimensions. Nevertheless, it is acknowledged here that if the feasible region is made of further constraints, no such guarantees exist. It should however be mentioned that the analysis of SGLD and GLD algorithms is currently a very active research area

In order for the iterations (47) to converge, the gradient \(\nabla _{\varvec{w}}{\bar{h}}(\varvec{w})\) needs to be Lipschitz continuous on the domain given by the feasible set \(\mathcal {W}\). The Hessian of h is given by

$$\begin{aligned} \nabla _{\varvec{w}}^2 {\bar{h}}(\varvec{w})=&\dfrac{1}{\left( {\bar{g}}(\varvec{w})\right) ^3} \left( \left( {\bar{g}}(\varvec{w})\right) ^2\nabla _{\varvec{w}}^2{\bar{f}}(\varvec{w})-{\bar{g}}(\varvec{w})\left( \nabla _{\varvec{w}}{\bar{f}}(\varvec{w})\left( \nabla _{\varvec{w}}{\bar{g}}(\varvec{w}) \right) ^{\top }+\nabla _{\varvec{w}}{\bar{g}}(\varvec{w})\left( \nabla _{\varvec{w}}{\bar{f}}(\varvec{w}) \right) ^{\top } \right) \right. \nonumber \\&+ \left. {\bar{g}}(\varvec{w}){\bar{f}}(\varvec{w})\nabla _{\varvec{w}}^2 {\bar{g}}(\varvec{w})+2{\bar{f}}(\varvec{w})\nabla _{\varvec{w}}{\bar{g}}(\varvec{w})\left( \nabla _{\varvec{w}}{\bar{g}}(\varvec{w}) \right) ^{\top } \right) , \end{aligned}$$
(50)

where \(\nabla _{\varvec{w}}{\bar{f}}(\varvec{w})\) and \(\nabla _{\varvec{w}}^2 {\bar{f}}(\varvec{w})\) are given in Appendix A, \(\nabla _{\varvec{w}}{\bar{g}}(\varvec{w})\) is given in (49) and

$$\begin{aligned} \nabla _{\varvec{w}}^2 {\bar{g}}(\varvec{w})=12\left( \varvec{w}^{\top }\hat{{\mathbf {M}}}_2\varvec{w}\right) \hat{{\mathbf {M}}}_2. \end{aligned}$$
(51)

In (50), each component of the numerator is a polynomial of degree 10 and the denominator is a polynomial of degree 12. Since it is assumed that \(\hat{{\mathbf {M}}}_2\) is positive definite, the minimum c of \({\bar{g}}(\varvec{w})\) is strictly positive over \(\mathcal {W}\), and one has that \(|{\bar{g}}(\varvec{w})|\ge c>0\) for \(\varvec{w}\in \mathcal {W}\). As each component of \(\nabla _{\varvec{w}}^2 {\bar{h}}(\varvec{w})\) is a continuous function its value is bounded on a closed compact set, and hence

$$\begin{aligned} \Vert \nabla _{\varvec{w}}^2 {\bar{h}}(\varvec{w})\Vert _2 \le K, \text { for all }\varvec{w}\in \mathcal {W}, \end{aligned}$$
(52)

which, using the mean value theorem, implies

$$\begin{aligned} \Vert \nabla _{\varvec{w}}{\bar{h}}(\varvec{u})-\nabla _{\varvec{w}}{\bar{h}}(\varvec{v}) \Vert _2\le K\Vert \varvec{u}-\varvec{v}\Vert _2, \text { for all } \varvec{u}, \varvec{v}\in \mathcal {W}, \end{aligned}$$
(53)

where the matrix norm in (52) is defined as the Hilbert–Schmidt norm. Thus, the gradient of the portfolio kurtosis is Lipschitz continuous over the feasible set. In both Raginsky et al. [61] and Xu et al. [77] it is required that the objective function is dissipative in order for the convergence results to hold. The objective function \({\bar{h}}\) is dissipative on \(\mathcal {W}\) if there exists constants \(m>0\) and \(b\ge 0\) such that

$$\begin{aligned} \varvec{w}^{\top } \nabla _{\varvec{w}}{\bar{h}}(\varvec{w}) \ge m\Vert \varvec{w}\Vert _2^2-b, \text { } \forall \varvec{w}\in \mathcal {W}. \end{aligned}$$
(54)

Since the gradient of \({\bar{h}}(\varvec{w})\) is a continuous function it is bounded over \(\mathcal {W}\):

$$\begin{aligned} \Vert \nabla _{\varvec{w}}{\bar{h}}(\varvec{w}) \Vert _2 \le K_2, \text { } \forall \varvec{w}\in \mathcal {W}. \end{aligned}$$
(55)

Over the n-simplex \(\mathcal {W}\), the Cauchy–Schwartz inequality implies

$$\begin{aligned} |\varvec{w}^{\top }\nabla _{\varvec{w}}{\bar{h}}(\varvec{w})|\le \Vert \varvec{w}\Vert _2\Vert \nabla _{\varvec{w}}{\bar{h}}(\varvec{w}) \Vert _2 \le \Vert \nabla _{\varvec{w}}{\bar{h}}(\varvec{w}) \Vert _2 \le K_2, \end{aligned}$$
(56)

and hence \(\varvec{w}^{\top }\nabla _{\varvec{w}}{\bar{h}}(\varvec{w}) \ge -K_2\). Furthermore, \(a(\Vert \varvec{w}\Vert _2^2-1) \le 0\), for \(a>0\), implying

$$\begin{aligned} \varvec{w}^{\top }\nabla _{\varvec{w}}{\bar{h}}(\varvec{w}) \ge a(\Vert \varvec{w}\Vert _2^2-1)-K_2=a\Vert \varvec{w}\Vert _2^2-(K_2+a)=a\Vert \varvec{w}\Vert _2^2-b, \end{aligned}$$
(57)

and hence \({\bar{h}}(\varvec{w})\) is dissipative over \(\mathcal {W}\). This means that portfolio kurtosis satisfies the assumptions underlying the convergence results in the non-convex case for GLD and SGLD without projection. Even though we cannot rely on formal convergence results from the literature for GLD with projection, we will in the next section apply the projected GLD algorithm to some example problems with multiple local minima.

4.3 Numerical illustration

In this section we apply the projected GLD algorithm to an artificial problem of kurtosis minimization when all assets are assumed to have identical marginal distributions and where all correlations between different assets are assumed to be negative and identical. Moreover, it assumed that the weights are positive and sum to one. This problem provides a test bed for testing if the projected GLD algorithm finds an optimal point which is close to the global optimum. The problem has several local optima which, for the non-zero weights, represent equally weighted portfolios with exposure to all or a subset of the assets. To see that this must be the case, consider two assets with non-zero weight at a local optimum. Since the assets are linearly dependent and have the same marginal kurtosis, the optimization will allocate equal weight to both assets when they are non-zero. Hence, all locally optimal portfolios assign equal weight to all weights that are non-zero at the respective local optimum.

The sample covariance matrix \(\hat{{\mathbf {M}}}_2\) and the sample fourth co-moment matrix \(\hat{{\mathbf {M}}}_4\) are calculated from \(10^7\) simulated asset returns with NIG-distributed margins and dependence structure given by the Gaussian copula with a homogeneous correlation matrix. The simulation procedure is described in Appendix B. Portfolio kurtosis for equally weighted portfolios for the cases with homogeneous correlation matrices with correlation \(\rho =-0.2\) and \(\rho =-0.05\), respectively, and marginal kurtosis \(\kappa _m=6\), are displayed in Fig. 9. Note that for the described experimental setup, assuming no estimation error, the portfolio kurtosis for an equally weighted portfolio with n assets is equal to the portfolio kurtosis for an \(n+1\) asset portfolio with equal weight in n assets and zero weight in the remaining asset. To see this, consider the definition of portfolio kurtosis

$$\begin{aligned} \kappa _p(\varvec{w})=\dfrac{\mathbb {E}\left( \varvec{w}^{\top } \varvec{r}\right) ^4}{\left( \mathbb {E}\left( \varvec{w}^{\top } \varvec{r}\right) ^2\right) ^2}. \end{aligned}$$
(58)

From the definition it is apparent that setting one of the weights to zero and the remaining weights to 1/n for the kurtosis in the \(n+1\) asset case is identical to the kurtosis for the equally weighted portfolio in the n asset case. Inspecting the graph in Fig. 9 a, one observes that the global optima for the five asset case are located at points with equal weights in four of the assets and zero weight in the remaining asset. Thus, assuming no estimation error, there are five global optima for the five asset problem. With \(10^7\) simulated sets of asset returns, the estimation error is small but nevertheless not zero and hence one of the points represents the unique global optimum with simulated data. Figure 9 b displays the portfolio kurtosis for equally weighted portfolios with up to 15 assets for the case when the homogeneous correlation is \(-0.05\). Even though not distinguishable from the graph, the portfolio kurtosis for the equally weighted portfolio with 14 assets is slightly lower than the equally weighted portfolio with 15 assets, for every 14 asset sub-portfolio. Thus, for the 15 asset problem the global optimum for the kurtosis minimization problem is given by assigning equal weight to 14 of the assets and zero weight to the remaining asset.

Fig. 9
figure 9

Portfolio kurtosis for equally weighted portfolios as a function of the number of assets. The asset distributions are generated from a Gaussian copula with homogeneous correlation matrix and NIG-distributed margins with kurtosis \(\kappa _m=6\). (a) Homogeneous correlation \(\rho =-0.2\) (b) Homogeneous correlation \(\rho =-0.05\)

The projected GLD algorithm is applied to the problem of minimizing portfolio kurtosis for the experimental setup described above, with five and 15 assets, respectively. We implement a multistart version of the algorithm, where the iterations (47) are started from points \(\varvec{w}_0\in \mathcal {W}\) uniformly sampled over the feasible set. In order to generate starting points that are evenly distributed over the n-simplex defining the feasible set, the method described in Shaw [68] is used. For each generated path of the projected GLD iterations, the point with the smallest recorded objective function value is stored. The output from the algorithm \({\tilde{\varvec{w}}}\) is the point with the smallest overall recorded objective function value. Finally, the optimal solution is taken to be

$$\begin{aligned} \varvec{w}^{\text {GLD}}={{\,\mathrm{argmin}\,}}\{\kappa _p({\tilde{\varvec{w}}}),\kappa _p(\varvec{w}^{\text {Loc}})\}, \end{aligned}$$
(59)

where \(\varvec{w}^{\text {Loc}}\) denotes the solution from a local solver started at \({\tilde{\varvec{w}}}\). The complete multistart projected GLD algorithm is summarized below. The fixed step size \(\lambda \) is chosen to be 0.01 (based on numerical experimentation) for both the five and 15 asset problems. The temperature parameter \(\beta \) is chosen large enough so that the iterations from the projected GLD algorithm can jump between different local optima. Based on initial experimentation, the following formula for the temperature was chosen

$$\begin{aligned} \beta =\dfrac{2\lambda (n+1)^2}{c^2}, \end{aligned}$$
(60)

where \(n+1\) is the number of assets and c was chosen to be 0.06 for the five asset case and 0.1 for the 15 asset case. For the implementation, the number of paths \(n_{sim}\) was \(10^5\) and the number of iterations for each path \(n_{iter}\) was \(10^4\), implying that \(10^9\) points in the search space were visited by the algorithm. Given the multistart implementation, the algorithm is very easy to parallelize. The algorithm was parallelized and implemented on a multi-core processor with 24 cores. For the five asset case, the multistart projected GLD algorithm finds a solution with equal weight in four assets and zero weight in one asset. By running the BB algorithm on the same problem it was confirmed that the GLD algorithm finds the global optimum. The runtime for the parallelized algorithm with 24 cores was 2,476 seconds for the five asset case and 8,322 seconds with 15 assets.

Multistart projected GLD algorithm

Input: \(\lambda \), \(\beta \), \(n_{sim}\), \(n_{iter}\), \(\hat{\mathbf {M}}_2\), \(\hat{\mathbf {M}}_4\).

for \(i=1,2,\ldots ,n_{sim}\) do

   Generate \(\varvec{w}_0\in \mathbb {R}^{n+1}\) uniformly on \(\mathcal {W}\);

   for \(k=0,1,\ldots ,n_{iter}\) do

      Generate \({\varvec{\epsilon }}_{k+1}\sim N(\varvec{0},\mathbf {I})\);

      \(\varvec{w}_{k+1}=\Pi _{\mathcal {W}}\left( \varvec{w}_k-\lambda \nabla _{\varvec{w}}\bar{h}(\varvec{w}_k)+\sqrt{2\lambda /\beta }{\varvec{\epsilon }}_{k+1}\right) \);

   end

   \(\varvec{w}_i^s={{\,\mathrm{argmin}\,}}\{\kappa _p(\varvec{w}_0), \kappa _p(\varvec{w}_1), \ldots , \kappa _p(\varvec{w}_{n_{iter}}) \}\);

end

Output: \(\tilde{\varvec{w}}={{\,\mathrm{argmin}\,}}\{\kappa _p(\varvec{w}_0^s), \kappa _p(\varvec{w}_1^s), \ldots , \kappa _p(\varvec{w}_{n_{sim}}^s) \}\);

Fig. 10
figure 10

The distribution of the final iterate for the weight of asset 1 in the 15 asset case: the full distribution (a) and the zoomed in distribution (b)

Fig. 11
figure 11

One of the paths produced by the projected GLD algorithm for the weight of: asset 1 (a) asset 15 (b)

For the 15 asset case, the output from the GLD algorithm is a portfolio with equal weights in 14 of the assets and zero weight in one asset. As argued above, this represents a global optimum for the 15 asset problem. As a comparison, a local solver was started from the generated starting points \(\varvec{w}_0\) for each of the \(n_{sim}\) outer simulations of the GLD algorithm. The built-in interior point solver in MATLAB was used as local solver. For all of the generated starting points, the output from the local solver is the equally weighted portfolio with non-zero weights in all of the 15 assets. Thus, the multistart projected GLD algorithm jumps between the local optima and is able to locate the global optimum for the 15 asset problem, whereas a multistart algorithm which uses a local solver finds a local optimum in all cases. The distribution of the final iterate from the GLD algorithm for one of the assets is illustrated in Fig. 10. Figure 10 a displays the full distribution where many of the final iterates are concentrated around zero, whereas Fig. 10 b shows the distribution zoomed in around the non-zero weights. The ability of the GLD algorithm to produce iterates that jump between different local optima is illustrated by the graphs in Fig. 11. In general it is not possible to verify if the output from the GLD algorithm is a global optimum, but the experiments indicate that the algorithm is a useful tool for locating the global optimum for problems for which the number of assets is out of reach for the BB algorithm.

5 Diversification of a typical multi-asset universe–U.S. investor

In this section we apply the introduced novel portfolio diversification framework to an asset universe which is representative of the constituents of a typical U.S. institutional investor portfolio. The asset universe is multi-asset in nature and consists of exposures to U.S. and international equities in developed and emerging markets as well as exposures to property, corporate bonds across both investment grade and high yield, government bonds in developed and emerging markets and also inflation protected securities. For the analysis, each of the 12 asset classes is represented by a suitable index which accurately captures the respective asset class characteristics. The chosen indices are listed in Table 3, which also includes a short description of each index. The asset universe is divided into three broader categories: equities including REITS (EQ), higher yielding credit including emerging markets debt (HY) and government and investment grade bonds (BD). The categorisation can be justified by the correlation structure as well as common practices in the industry. It is likely that the relative weights of these three categories are in practise decided by strategic decisions, constraints and risk appetite. It is therefore not relevant or realistic to compare portfolio construction methodologies that can allocate freely across these categories. In particular, we believe that empirical studies which allow large overweights in bonds (such as a naïve application of risk parity) are unlikely to be useful for practitioners when making forward-looking decisions given the unprecedented bull-run over the last 30 years in bonds and the historically low level of yields available at this point in time. We therefore chose a hierarchical approach, which assumes fixed weights across the three categories in line with the median asset allocation of large U.S. institutional investors (55% in equities, 20% in higher yielding credit and 25% in bonds). The weights within each category are determined by minimizing portfolio kurtosis and through various other portfolio construction methodologies for comparison.

Table 3 The indices that constitute the asset universe of a typical U.S. institutional investor.

Due to the large estimation errors associated with estimates of higher order moments, each index is assigned a representative kurtosis parameter. The parameter values, given in Table 3, were determined from a combination of estimation from historical data and consistency with estimates reported in the literature, see e.g. Xiong and Idzorek [76]. Historical data reveals, in many cases strong, positive correlations between the indices within the equity, higher yielding credit and bond sub-portfolios, whereas the correlations between indices belonging to different sub-portfolios are comparatively small and time varying. The heat map of the correlation matrix for the asset universe as of February 2005 in Fig. 12 illustrates the approximate block diagonal structure. In the heat map, a red colour represents a positive correlation, whereas a negative correlation is given in blue. The intensity of the colour indicates the magnitude of the positive and negative correlations. From the heat map it is evident that the intra-block correlations are strongly positive, whereas the inter-block correlations are much weaker with both positive and negative values.

Fig. 12
figure 12

Heat map of the approximately block diagonal correlation matrix for the 12 indices as of the start of the backtesting period in February 2005. The three blocks represent the equity (EQ), high yield (HY) and bond (BD) portfolios, respectively. Positive and negative correlations are indicated in red and blue, respectively, whereas the magnitudes of the correlations are indicated by the intensity of the colour

The portfolio diversification framework developed in this paper was applied to the multi-asset universe in an out-of-sample analysis. Historical returns expressed in U.S. dollar terms for the 12 indices over the period from September 2001 to September 2020, obtained from Bloomberg and Thomson Reuters DataStream, were used as input to the analysis. We are considering a realistic real world multi-asset allocation problem with semi-annual rebalancing of the portfolio. At each rebalancing date the covariance matrix was estimated from the previous 180 observations of weekly returns using an EWMA estimator with a half-life of one year. Since the first 180 weeks of the return history were used for estimating the covariance matrix for the first rebalancing date in February 2005, there are 32 semi-annual rebalancing dates over the backtesting period.

In Sect. 2.2 it was shown that when the measure of non-Gaussianity \(\nu \) is given by excess kurtosis, portfolio dimensionality is defined as

$$\begin{aligned} d_{Z,\nu }(\varvec{w})=\dfrac{\nu (Z)}{\nu \left( \varvec{w}^{\top }\varvec{r}\right) }, \end{aligned}$$
(61)

where \(\varvec{r}\) is the return vector, \(\varvec{w}\) is the corresponding weight vector and Z represents the reference asset. For all problem instances encountered over the backtesting period, the minimum portfolio excess kurtosis is always positive. Thus, in this case maximizing portfolio dimensionality is equivalent to minimizing portfolio kurtosis. In the remainder of this section we will therefore denote the minimum kurtosis portfolio as the optimised dimensionality portfolio. As explained in the beginning of this section, the relative weights across the three sub-portfolios were held constant for each rebalancing date and hence the following minimum kurtosis problem was solved for each sub-portfolio and rebalancing date

$$\begin{aligned} \min \limits _{\varvec{w}\in {\mathcal {W}}}\dfrac{\varvec{w}^{\top }\mathbf {M}_4(\varvec{w}\otimes \varvec{w}\otimes \varvec{w})}{\left( \varvec{w}^{\top } \mathbf {M}_2\varvec{w}\right) ^2}, \end{aligned}$$
(62)

where the feasible set is given by \({\mathcal {W}}=\left\{ \varvec{w}\in \mathbb {R}^{n_s+1} \text { }| \text { }\sum _{i=0}^{n_s}w_i=1, \text { }w_i \ge 0, \text { }i=0,\ldots ,n_s \right\} \), and \(n_s+1\) is the number of assets in sub-portfolio s. As input to the optimization, the moment matrix \(\mathbf {M}_4\) was estimated by simulating \(10^7\) realisations of the asset returns whose multivariate distribution was modeled by a Gaussian copula and NIG-distributed margins, see Appendix B. Even though the estimated covariance matrix is an input to the portfolio construction at each rebalancing date, in order to be consistent, also the moment matrix \(\mathbf {M}_2\) was estimated from the simulated returns. We chose to model the dependence structure with a Gaussian copula as it allows practitioners to base the model on readily available model data, since existing portfolio diversification models are based on estimated covariance matrices. The model can hence be seen as the simplest possible enhancement to existing portfolio diversification models, where the only further model input that is needed are the marginals for the assets. The minimum kurtosis problem was solved for each sub-portfolio and each rebalancing date over the backtesting period. At each rebalancing date the total portfolio was then constructed by allocating 55% of the capital to the equity sub-portfolio, 20% to the high yield sub-portfolio and 25% to the bond sub-portfolio,

$$\begin{aligned} \varvec{w}_{{OD}}^{\top }=\left[ 0.55 \varvec{w}_{{EQ}}^{\top },\text { }0.2 \varvec{w}_{{HY}}^{\top }, \text { }0.25 \varvec{w}_{{BD}}^{\top }\right] , \end{aligned}$$
(63)

where \(\varvec{w}_{{OD}}\), \(\varvec{w}_{{EQ}}\), \(\varvec{w}_{{HY}}\) and \(\varvec{w}_{{BD}}\) denote the weight vectors for the optimised dimensionality portfolio and the equity-, high yield- and bond sub-portfolios, respectively. Obviously, given that the weights across the sub-portfolios are fixed, the total optimised dimensionality portfolio is likely not optimal for the full 12 asset universe.

As exemplified by the numerical example in Sect. 3.3, the computational time to solve the six asset problem with the BB algorithm is around 14 hours. Since the minimum kurtosis problems were solved for all 32 rebalancing dates, instead of solving the six asset problem with the BB algorithm an alternative strategy was used. The strategy exploits an empirical observation of the minimum kurtosis problem: in the case of non-negative correlations and positive weights that sum to one, there is just one global minimum, no other local minima, and the global minimum can always be found by local optimiser. For such problem instances, extensive empirical runs have failed to generate even a single counter example where the solution from a local solver is different from the solution returned by the BB algorithm. Since the correlation matrices for each sub-portfolio and rebalancing date only contain positive values, the minimum kurtosis problem instances all have this empirically observed property. This observation was also verified for the problem instances at hand by solving a reduced version of the minimum kurtosis problem where one of the assets from the equity sub-portfolio was removed, resulting in a five asset problem. In a first step, the reduced minimum kurtosis problem was solved with a local solver for each sub-portfolio and rebalancing date, using the same seed for the random number generator when generating the scenarios for calculation of the moment matrices. In the next step, the same sequence of minimum kurtosis problems was solved using the BB algorithm and the same seed for the random number generation. This allowed us to confirm that the local solver found the global optimum for all rebalancing dates and all sub-portfolios. This is a strong indication that the local solver finds the global optimum also for the full problem for all problem instances. When solving the minimum kurtosis problems we therefore used the built-in Matlab interior point solver for the six asset problem.

In the analysis, the optimised dimensionality portfolio is compared to a portfolio which is representative of the median U.S. institutional investor, as well as the portfolios obtained from applying four commonly used portfolio construction methodologies to the multi-asset universe: risk parity, diversification ratio, minimum variance and equal weights. The portfolio construction methodologies are applied to the three sub-portfolios separately, constraining the weights to be non-negative and summing to one. The portfolio weights for the median U.S. investor are based on internal estimates from Aberdeen Standard Investments. Naturally, the allocation changes over time but the weights are a good representation of the typical U.S. institutional investor portfolio. As for the optimised dimensionality portfolio, the total portfolio weights for each methodology are finally determined by assigning 55% of the capital to the equity sub-portfolio, 20% to the high yield sub-portfolio and 25% to the bond sub-portfolio. In Bai et al. [8] it is shown that the risk parity portfolio subject to non-negative weights summing to one can be found by solving a convex optimization problem. The risk parity weights for the sub-portfolios are thus determined as the solution to a convex optimization problem which uses the estimated covariance matrix at each rebalancing date as input. As shown in Choueifaty et al. [22], also the problem of maximising the diversification ratio

$$\begin{aligned} \dfrac{\varvec{w}^T \varvec{\sigma }}{\sqrt{\varvec{w}^{\top }\mathbf {M}_2\varvec{w}}}, \end{aligned}$$
(64)

where \(\varvec{\sigma }\) is the vector of asset volatilities, when the weights are constrained to be non-negative and summing to one, is a convex optimisation problem. Since the problem of minimizing portfolio variance \(\varvec{w}^{\top } \mathbf {M}_2\varvec{w}\) for a fully invested long-only portfolio is a quadratic program, the portfolios used for comparison with the optimised dimensionality portfolio were either determined by a convex optimisation problem using the estimated covariance matrix as input, or were using fixed weights for all rebalancing dates. The fixed weights of the U.S. institutional portfolio and the equally weighted portfolio are displayed in Table 4.

Table 4 Average weights for the six portfolios obtained from the out-of-sample analysis over the backtesting period from 16 February 2005 to 16 September 2020.

We now analyse the portfolios obtained from applying the minimum kurtosis as well as the described alternative methodologies to the multi-asset universe for the described out-of-sample time series study. We have chosen a realistic setup with semi-annual rebalancing and fixed weights between the sub-portfolios, reflecting a portfolio allocation process of a typical large U.S. institutional investor. The average weights over time for the six portfolio construction methodologies are given in Table 4 and illustrated in Fig. 13. From the table, one observes that the risk parity portfolio has the most evenly distributed weights out of the dynamic portfolios, whereas the minimum variance methodology produces the most concentrated portfolio. The average weights of the optimised dimensionality portfolio and the diversification ratio portfolio are very similar for the equity and high yield sub-portfolios, whereas the weights differ substantially for the bond portfolio. The weights for these two portfolios are more unevenly distributed than for the risk parity but much less concentrated than the minimum variance portfolio.

Out of the dynamic portfolios, the risk parity has the most stable weights over time, whereas the minimum variance portfolio has the highest turnover. The optimised dimensionality portfolio and the diversification ratio portfolio lie in between the risk parity and minimum variance portfolios in terms of weight stability over time, with the optimised dimensionality portfolio showing the lowest turnover of the two.

Fig. 13
figure 13

Average weights for the six portfolios obtained from the out-of-sample analysis over the backtesting period from 16 February 2005 to 16 September 2020

The realized performances of the six portfolios over the backtesting period from 16 February 2005 to 16 September 2020 are shown in Fig. 14. The sample period incorporates the periods of the Global Financial Crisis of 2007–2008, the subsequent European debt crisis and the recent sharp decline in asset prices caused by the economic impact of the Corona virus. From the graph it can be observed that the performances of the six portfolios are very similar. Since we are considering a realistic multi-asset portfolio allocation example, where the weights across sub-portfolios are held constant, we are not expecting dramatically different behaviours across portfolios which is confirmed in the graph. In addition to the portfolio performances, the graph also displays the time intervals for seven historical bear market scenarios as indicated by the grey shaded areas. The definitions of the bear market scenarios are given in Table 5 together with a description of each scenario. As can be seen from the graph in Fig. 14, the most dramatic drop in portfolio value for all of the portfolios was caused by the recent market decline due to the Corona virus, which caused sharp falls in asset prices and liquidity to evaporate.

Fig. 14
figure 14

Realized performance of the six portfolios obtained from the out-of-sample analysis over the backtesting period from 16 February 2005 to 16 September 2020. The grey shaded areas represent the time intervals for the bear market scenarios defined in Table 5

In this paper we have argued strongly that measures of diversification should be related to the tail properties of portfolio returns and thus introduced the notion of dimensionality. In practise, tail risk is perceived different across investor types depending on institutional mandate, regulatory restrictions or simply risk appetite. We therefore compare portfolio construction techniques not by Sharpe ratio as is commonly done, but across a variety of commonly used measures of tail risk; statistical measures, such as skewness and kurtosis in addition to measures used more in practise, such as maximum drawdown and Expected Shortfall at different confidence levels scaled by the volatility. Instead of using Expected Shortfall directly as a tail risk measure, we consider Expected Shortfall divided by the portfolio volatility

$$\begin{aligned} \dfrac{ES_{\alpha }(\varvec{w})}{\sqrt{\varvec{w}^{\top }\mathbf {M}_2\varvec{w}}}, \end{aligned}$$
(65)

for three different confidence levels \(\alpha \). The ratio in (65) satisfies our requirement of being a leverage invariant tail risk measure, since Expected Shortfall and portfolio volatility are both homogeneous functions of degree one. For a chosen time interval, the maximum drawdown (MDD) of a portfolio is defined as the maximum observed loss from a peak to a through, before a new peak is attained. The drawdown (DD) of a portfolio with value W(t) at time \(t\ge 0\) is defined as

$$\begin{aligned} DD(t)=\dfrac{W(t)-W_{{peak}}(t)}{W_{{peak}}(t)}, \quad \text { where } \quad W_{{peak}}(t)=\max \limits _{\tau \in [0,t]}W(\tau ). \end{aligned}$$
(66)

The MDD over the time interval [0, T] is then formally defined as

$$\begin{aligned} MDD_{[0,T]}=\min \limits _{t\in [0,T]} DD(t). \end{aligned}$$
(67)
Table 5 Definitions and descriptions of the bear market scenarios

The realized tail risk measures, together with the realized mean return, volatility and Sharpe ratio, of the six portfolios over the backtesting period are given in Table 6. For each measure, the value of the best performing portfolio is displayed in bold font. Even though the portfolio weights of most of the dynamic construction methodologies differ substantially, the realized volatility of the aggregated portfolios are of approximately the same magnitude for all of the six portfolios, which can be attributed to the fixed weights across the sub-portfolios. Since the realized mean return of the portfolios are of the same magnitude, also the Sharpe ratio is very similar across methodologies. As we have argued, the diversification properties of the portfolio construction methodologies should be evaluated through realisations of commonly used tail risk measures over the sample period.

Table 6 Out-of-sample realized mean return, volatility, Sharpe ratio and six measures of tail risk for the six portfolios over the backtesting period from 16 February 2005 to 16 September 2020.
Table 7 Tail risk rankings of the six portfolios for each of the six tail risk measures

In a realistic setting, where rebalancing cannot happen very often (we assume semi-annual rebalancing, as is often the practise) and using fixed weights for the different sub-portfolios, it is unlikely to observe dramatically different behaviour across portfolios and it is also not to be expected that one technique would dominate others on all measures of tail risk. We therefore rank portfolio construction methodologies for each measure of tail risk as demonstrated in Table 7. From the table we conclude that our proposed portfolio construction methodology through dimensionality achieved the best tail properties on average. The table also reveals that the optimised dimensionality portfolio has the fourth highest out-of-sample kurtosis. This can be explained by two main factors. First, the optimised dimensionality portfolio is found by minimizing portfolio kurtosis, given the chosen kurtosis parameters for the individual assets in the portfolio. The realized values of kurtosis for the individual assets over the backtesting period were in many cases far from the chosen parameters. Second, even if the portfolio kurtosis for the optimised dimensionality portfolio is lower for the sub-portfolios, the fact that we use fixed weights between the sub-portfolios can create a sub-optimal overall portfolio in terms of kurtosis. The small difference in average tail risk ranking between the optimised dimensionality and the diversification ratio portfolios can be attributed to the fact that the two methodologies produce very similar weights for the equity- and high yield sub-portfolios. If more precise estimates of tail properties are known and used as input, the results of the optimised dimensionality portfolio can be refined.

In Table 8 we complement the analysis by investigating the performance of the portfolio construction methodologies over the bear market scenarios defined in Table 5. As can be observed in the table, the minimum variance portfolio realized losses of the smallest magnitude for all but one of the bear market scenarios. This is an indication that with perfect foresight, one should concentrate the allocation to the assets with the lowest volatility during periods of extreme market stress. However, in practise it is of course not possible to perfectly forecast the timing of these periods. As demonstrated through the realized tail risk measures, the best overall tail risk properties over the full backtesting period, which includes several periods of extreme market stress, is obtained via the optimised dimensionality methodology.

Table 8 Returns for the six portfolios over the bear market scenarios with start and end dates as defined in Table 5

The dimensionalities of the aggregated portfolio, as well as of the three sub-portfolios, were measured for all of the portfolio construction methodologies for each rebalancing date over the backtesting period. The excess kurtosis \(\nu (Z)\) of the reference asset in the portfolio dimensionality expression (61) was chosen to be representative of the respective sub-portfolio. Hence, the excess kurtosis for the reference asset when measuring portfolio dimensionality for the equity-, high yield- and bond portfolios were chosen to be 3, 9, and 1, respectively. For the aggregated portfolio we chose an excess kurtosis of 3 for the reference asset, allowing us to interpret the dimensionalities of the aggregated portfolios as the equivalent number of independent equity exposures. In Figs. 15 and 16, the measured dimensionalities of the aggregated, as well as the three sub-portfolios, are displayed for all six portfolio construction methodologies. From the graphs in Figs. 15b and 16a, one can observe that, for the equity- and high yield sub-portfolios, the dimensionalities of the optimised dimensionality and the diversification ratio portfolios are very close, and higher than for the other portfolios, for all dates over the backtesting period. For the bond sub-portfolio, the optimised dimensionality portfolio dominates the other portfolios over the full sample period in terms of measured dimensionality.

Fig. 15
figure 15

a Dimensionality of the six aggregated 55-20-25 portfolios over the backtesting period from 16 February 2005 to 16 September 2020. b Dimensionality of the six equity portfolios over the same period

Fig. 16
figure 16

a Dimensionality of the six high yield portfolios over the backtesting period from 16 February 2005 to 16 September 2020. b Dimensionality of the six bond portfolios over the same period

Since the weights across sub-portfolios are fixed, the dimensionality obtained with the optimised dimensionality methodology is not guaranteed to be optimal for the aggregate portfolio. This can be observed in Fig. 15a, where the dimensionality of the diversification ratio portfolio dominates all other portfolios for some dates at the beginning of the backtesting period. The graph in Fig. 15a reveals that the dimensionalities of five of the aggregate portfolios are very close during bear market periods when correlations between asset classes increase, indicating that opportunities for diversification decrease in such circumstances.

To conclude, we have, using a realistic setup with fixed weights across sub-portfolios and semi-annual rebalancing, demonstrated the usefulness of our proposed methodology based on dimensionality. Compared to four commonly used portfolio construction methodologies and the U.S. institutional median portfolio, the optimised dimensionality portfolio showed the best overall tail risk properties over the sample period. Finally we observe that an easily interpretable statistic, such as portfolio dimensionality, has the additional advantage of being readily explainable and being a relevant statistic in itself. As a final remark, we mention that it can be very beneficial to extend the investment universe if dimensionality optimization or tail risk mitigation is the objective. The investment universe could be extended with strategies that are not driven by traditional risk premia, but utilise structural or behavioral effects in the markets. Examples of such strategies are momentum and low beta strategies which show low correlations to traditional asset classes, see e.g. Jegadeesh and Titman [45], Asness et al. [5], and Frazzini and Pedersen [34].

6 Conclusions

In this paper we have introduced a portfolio diversification framework based on a novel measure called portfolio dimensionality. This measure is directly related to the tail risk of the portfolio and it is leverage invariant, which means that it can typically be expressed as the ratio of convex functions. In order to solve the global optimization problem that arises when minimizing portfolio kurtosis, two complementary global optimization algorithms have been formulated, one deterministic BB algorithm and one stochastic GLD algorithm. Solving the problem with the BB algorithm, one can guarantee that the global optimum has been found. However, it suffers from the curse of dimensionality which limits the size of the problem when the BB algorithm is used for the optimization. A complementary stochastic optimization algorithm for the global optimization problem has therefore been formulated. As illustrated in Sect. 4.3, the multistart projected GLD algorithm can find the global optimum for cases when a multistart local solver algorithm does not. The projected GLD algorithm therefore complements the BB algorithm and allows for solving problems in higher dimensions, albeit without the guarantee that the global optimum will be found. An alternative solution method could be to run the BB algorithm for a fixed number of iterations when solving larger problems. Empirically we observed that the vast majority of the solution time for the BB algorithm is spent on proving optimality for a point found early on. This is a heuristic method that may be used instead of the projected GLD algorithm when solving larger problems. Furthermore, we observed empirically that for problem instances where all correlations are positive, a local solver finds the global optimum as verified by the BB algorithm.

Our introduced framework extends the diversification frameworks in the literature that are based on only the covariance matrix. Through numerical experiments we have illustrated that our framework possess desirable properties as introduced in the portfolio diversification literature. This can be contrasted to commonly used diversification frameworks such as risk parity and the most diversified portfolio. In order to avoid the problem of obtaining robust estimates of asymmetric tail dependencies between asset returns (see [33]), we have in this paper chosen to model the dependence structure with a Gaussian copula. It is possible to extend the framework to also taking dynamic volatilities and correlations as well as non-linear dependence into account. The model can be extended by using a dynamic GARCH model with skewed and leptokurtic innovations for the marginal distributions, as well as a dynamic conditional correlation model for the copula correlations, see Engle [30]. Furthermore, in order to capture the asymmetric tail dependence observed in the financial markets, a skewed t copula can be used as in Christoffersen et al. [23]. Alternatively, a non-linear dependence structure can be modelled with regime shifts as in Ang and Bekaert [3].