Analytic approach to variance optimization under an ℓ1 constraint

The optimization of the variance of a portfolio of N independent but not identically distributed assets, supplemented by a budget constraint and an asymmetric ℓ1 regularizer, is carried out analytically by the replica method borrowed from the theory of disordered systems. The asymmetric regularizer allows us to penalize short and long positions differently, so the present treatment includes the no-short-constrained portfolio optimization problem as a special case. Results are presented for the out-of-sample and the in-sample estimator of the regularized variance, the relative estimation error, the density of the assets eliminated from the portfolio by the regularizer, and the distribution of the optimal portfolio weights. We have studied the dependence of these quantities on the ratio r of the portfolio’s dimension N to the sample size T, and on the strength of the regularizer. We have checked the analytic results by numerical simulations, and found general agreement. Regularization extends the interval where the optimization can be carried out, and suppresses the large sample fluctuations, but the performance of ℓ1 regularization is rather disappointing: if the sample size is large relative to the dimension, i.e. r is small, the regularizer does not play any role, while for r’s where the regularizer starts to be felt the estimation error is already so large as to make the whole optimization exercise pointless. We find that the ℓ1 regularization can eliminate at most half the assets from the portfolio (by setting their weights to exactly zero), corresponding to this there is a critical ratio r = 2 beyond which the ℓ1 regularized variance cannot be optimized: the regularized variance becomes constant over the simplex. These facts do not seem to have been noticed in the literature.


Introduction
In this paper, we present analytic results for a simple quadratic optimization problem with a linear constraint plus an 1 regularizer. Although we are going to speak in terms of portfolio optimization, it is important to emphasize that the problem we address is not specific to portfolios, but is a generic feature of quadratic optimization if the dimension is high and the objective function is estimated on the basis of a limited number of observations. We will assume that there is no additional information (like prior knowledge or sparsity) available besides the observations and wish to find out how much can be learned from the limited data. Our objective function will be the portfolio variance. In order to find the optimum of the variance over the portfolio weights, one has to invert the estimated covariance matrix, which is possible only if its dimension N is not larger than the number of observations T . The ratio r = N/T is a fundamentally important control parameter of the problem. If the number of observations is much larger than the dimension, a e-mail: f.caccioli@ucl.ac.uk classical statistics works and the estimated optimum will be very close to the true optimum which can be obtained in the limit T → ∞. If T is not very large relative to N , we are in the high-dimensional regime where sample fluctuations can be large and regularizers have to be introduced to rein them in. Regularizers suppress large excursions, and unavoidably introduce bias, but the hope is that a reasonable trade-off can be achieved between the bias and sample fluctuations with a proper choice of the strength of the regularizer. To see whether this hope is fulfilled is one of the aims of this paper.
A common regularizer is 2 (shrinkage or ridge regression) whose effect has been studied by a number of authors, see [1][2][3][4][5][6][7] among many others. In its most recent nonlinear form shrinkage can produce very good quality estimates [8][9][10]. Another popular regularizer is based on the 1 norm (lasso) [11]. Lasso is known to lead to sparse estimates, reducing the effective dimension of the problem and stabilizing the estimator. Jagannathan and Ma [12] considered portfolio optimization under a constraint excluding short positions. Although they did not speak about regularization, a no-short constraint is, in fact, a special case of an asymmetric 1 regularizer.
Brodie et al. [13] and DeMiguel et al. [14] studied the effect of 1 regularization on the performance and stability of portfolio selection. Subsequently, a number of groups investigated various aspects of the application of 1 and related regularizers in portfolio optimization, e.g. [15][16][17][18][19].
The problem of optimizing the variance under an 1 constraint is a quadratic programming task which can be solved numerically. Our purpose here is to solve this problem analytically, which, to the best of our knowledge, has not been done before. The method that enables us to do this is borrowed from the theory of disordered systems and goes by the name of the method of replicas [20]. It assumes that the underlying distribution is Gaussian and it works in the Kolmogorov limit, where both N and T go to infinity, but their ratio r = N/T is kept finite.
We will show that the 1 regularizer does not eliminate the instability, only shifts its value. (A similar effect was observed in the case of the Expected Shortfall risk measure in [21].) The new critical value turns out to be r = 2, corresponding to the fact that 1 sets to strictly zero at most half the portfolio weights. (Of course, in a strongly heterogeneous portfolio the high volatility elements can be suppressed, with their weights rendered very small in the optimal portfolio. We distinguish this from the elimination of some of the dimensions by 1 , with the corresponding weights becoming exactly zero, even if there is little or no difference between their true volatilities.) There is an important difference between our analytic approach and the standard statistical estimation procedure which analyzes a given sample and tests it by cross validation [22]. Instead, our method allows us to average over the whole ensemble of samples. Corresponding to this, the step-like effect of 1 , eliminating the dimensions one by one, is replaced upon averaging over the samples by a smooth, monotonically increasing density of the zero weights.
In order to make contact with a previous work in which we treated the case of excluded short positions [23], we are going to consider an asymmetric 1 regularizer here, with different slopes for positive, resp. negative weights. We find that in the most important results only the right hand side slope appears.
Many dimensionality reduction or cleaning methods focus on the covariance matrix, especially on its spectrum. In contrast, the special version of the replica method we use allows us to derive the distribution of optimal portfolio weights directly.
The plan of the paper is as follows. In Section 2, we set up the problem and present some preliminary results. In Section 3, we recall some results from [23] where the task of optimizing over the N portfolio weights has been reduced to the optimization of an effective objective function depending on five order parameters. We also spell out the first order conditions (or saddle point conditions) that determine the stationary point of the objective function. The solution to the saddle point equations is analyzed in a number of subsections and the results for various special cases are displayed graphically. Section 4 is a summary of the results, while a sketch of the derivation of the effective objective function is provided in the Appendix.

Preliminaries
In this section, we set up the optimization problem, fix notation and present some preliminary results that will be useful as checks on the replica calculation later.
We consider a portfolio of N assets with random returns x i , i = 1, . . . , N . For simplicity, we assume that the returns are independent Gaussian random variables with zero expectation value and variance σ 2 i , which may be different for each asset i. For the time being we assume that we have complete knowledge of the distribution of the returns. If we denote the portfolio weights as w i , the return on the portfolio is N i=1 w i x i , and under the assumption above the variance of the portfolio will be This is to be minimized subject to the budget constraint where we set the budget to be N instead of the usual 1, to have O(1) weights in the limit of large N . Then, with the Lagrange multiplier associated with the budget constraint denoted by λ we would have to find the minimum of over the weights w i , a trivial task. So far, the distribution of the returns (in particular, the variances of the assets σ 2 i ) have been assumed to be known. In real life this is never the case, instead we have to estimate the optimal weights and portfolio variance on the basis of finite samples. Let us assume that we draw these samples from a multivariate distribution of independent Gaussian variables with individual standard deviations σ i . These samples are constituted of T observations for each asset: x it , i = 1, 2, . . . , N ; t = 1, 2, . . . , N . We wish to learn to what extent it is possible to recover the true optimum of the variance and the optimal weights by averaging over a large number of such samples.
Thus, we have the optimization problem where C ij is the estimated covariance matrix Substituting (5) into (4) the optimization problem becomes This is a quadratic optimization problem which can be solved numerically, as long as the covariance matrix is positive definite, which holds with probability one for T ≥ N , that is for r < 1.
As r approaches 1 from below, sample fluctuations become larger and larger, until at r = 1 the estimation error diverges, and for r > 1 the optimization becomes meaningless. In order to tame the large sample fluctuations, it is a standard procedure to introduce regularizers that suppress large excursions of the estimated weights (at the price of introducing bias).
The regularizer we wish to use here is based on the 1 norm (lasso) [11]. It is known to result in sparse estimates, which in the present context means eliminating a part of the assets from the optimal portfolio, thereby reducing its effective dimension. Lasso is extensively used in a variety of problems in high-dimensional statistics and machine learning [22,24]. Its first applications to portfolio optimization is due to Brodie et al. [13] and DeMiguel et al. [14]. For the non-analytic character of lasso a full analytic treatment has, to our knowledge, not been attempted. An analytic approach valid in the large N limit will be presented in the next section.
Let us spell out the 1 regularizer we are going to apply: where η 1 and η 2 are positive coefficients and θ(x) is the Heaviside function. The regularizer so defined is asymmetric, having different slopes for positive and negative weights. The special case η 1 = η 2 = η corresponds to the usual expression η i |w i |. Keeping the two slopes different allows us to penalize long and short positions differently.
Our regularized objective function is then As the first term is non-negative and the last term vanishes for w i 's satisfying the budget constraint, F is larger or equal to the minimum of 1 (η 1 , η 2 ), which is N η 1 . Therefore F ≥ N η 1 , where the equality holds when the variance vanishes and the weights minimize the regularizer 1 (η 1 , η 2 ) (which requires that they are on the simplex w i ≥ 0, ∀i, i w i = N ). Alternatively, for the value of the objective function per asset we have the inequality This will prove important later. As it stands, (8) is amenable for numerical work, with the returns drawn from a suitable distribution. When the returns are independent Gaussians and N and T are large, one can derive the analytic results displayed in the next section.
A special limit of the above optimization problem is already worth considering at this point, because it provides an important consistency check on the results to be presented later: Let us assume that we have very large samples compared with the number of assets in the portfolio, i.e. T N , or r = N/T → 0. This means we have complete information about the distribution of returns. Then, for the independent random returns considered here, the covariance matrix C ij becomes diagonal with diagonal elements σ 2 i , and the optimization problem becomes A little reflection shows that the solution of this optimization problem can satisfy the budget constraint i w i = N for a positive N only if the Lagrange multiplier λ is larger than the right slope η 1 of the regularizer: λ > η 1 . Then the Lagrange multiplier works out to be the optimal weights and the minimal value of the objective function F obtains as while the minimal value of the objective function per asset is Note the order of magnitudes in the above formulae: λ, η 1,2 and w * i are of O(1), the sum N j=1 1/σ 2 j and the objective function are O(N ). (Here and in most of the following, we are assuming that the standard deviations σ 2 j are of order one. When one of the assets is riskless, the corresponding σ 2 j goes to zero, and one has to pay special attention to this exceptional asset.) We also have to point out that there is a difference in the notation relative to our earlier papers, especially [23], where we absorbed a factor 1/2r in the definition of the objective function f * and the Lagrange multiplier λ. This did not change any of the results there, except sending λ to infinity in the limit r → 0, which resulted in some convenience. In contrast to that paper, instead of considering the special limit η 1 = 0 and η 2 → ∞, here we are going to keep the coefficients of the regularizer finite, so the convention of absorbing 1/2r into the objective function would dictate its absorption into η 1 , η 2 as well. This would distort some of the figures, and would make the message of the paper harder to grasp. Therefore, in the present paper we have this factor 1/2r explicitly written out and kept throughout the paper.
The results obtained above for the Lagrange multiplier, the optimal weights and the optimal value of the objective function in the limit r → 0 are the true values for these quantities that would be obtained over an infinitely long observation time, when sample fluctuations become irrelevant. Likewise, in the same limit the distribution p(w) of the optimal portfolio weights would be a series of sharp spikes where δ(x) is the Dirac delta distribution.
3 Results for the variance optimized under an 1 constraint Our task is to find the optimum of the objective function in (8), where the returns x it are assumed to be drawn from the joint probability density of N independent Gaussian variables with zero mean and variance σ 2 i . Following the special version of the replica method laid out in [21], in [23] we showed how the optimization of (8) could be reduced to that of an effective objective function depending on five "order parameters". The method we applied to achieve this was the method of replicas, borrowed from the statistical mechanics of disordered systems [20]. (We will denote this effective objective function by the same symbol f as its full-information counterpart in the preceding section, and will omit the adjective "effective" in the following.) The derivation has been presented in [21] and also in the appendices of [23], and is sketched in the Appendix to this paper for easier reference. In the present section, we can start from the expression for the effective objective function f (λ, q 0 , ∆,q 0 ,∆) depending on the order parameters λ, q 0 , ∆,q 0 ,∆, as given in the Appendix: where and the double average . . . z,σ means and σ i is the standard deviation of the distribution of returns on asset i. The minimum of the "potential" V is at Substituting this back into (17) and performing the averaging according to the recipe (18) we find This is then the explicit form of the last term in (16), which thus becomes The function W appearing here is the third integral of the standard normal Gaussian density; its precise definition will be given shortly, together with two more functions that appear frequently in the following.
Stationarity of (21) with respect to the order parameters gives the first order conditionŝ Here r = N/T , as before. The functions Φ, Ψ and W are the integrals of the Gaussian density: In the above formulae the following notations have been introduced: and where in (30), (31) and (32) use has been made of (22) and (23). With this we can eliminateq 0 and∆ from our equations. Proceeding similarly in (21) and using (26) we find the expression for the objective function in terms of the remaining three order parameters as According to the derivation of the objective function in the Appendix, when (24), (25) and (26) are solved and λ, q 0 and ∆ are obtained as functions of the control parameters r, η 1 and η 2 , equation (33) gives the in-sample estimate of the objective function. With (19) the distribution of weights obtains from p(w) = δ(w − w * ) zσ as where δ is the Dirac-delta function. The first term in this formula shows that the 1 regularizer eliminates some of the assets from the portfolio by setting their weight to zero. The density of these assets, n 0 , is given by The two sums are made up of truncated Gaussians, the first sum corresponding to the weight distribution of positive (long) positions, and the second to negative (short) ones. We see then that the series of discrete, sharp spikes in (15) is broadened by sample fluctuations, and in addition to the positive weights, also negative ones appear. Equation (34) reveals the meaning of the symbols introduced in (30), (31) and (32): w (i) 1 and w (i) 2 are the centers of the estimated positive resp. negative weight distributions of asset i, and σ (i) w is the width of these distributions. Note how the distribution of optimal weights has been obtained directly from our formalism, without having to go through the calculation of the estimated covariance matrix.
The order parameter q 0 will be of central importance for us. In [25] we showed that is proportional to the out-of-sample estimate of the variance ij w est i C true ij w est j as: where C true ij is the true covariance matrix, w true i the corresponding optimal portfolio weights, and w est i is the optimal weights corresponding to the estimated covariance matrix. The denominator in (37) serves just to normalizẽ q 0 . From the definition it is clear thatq 0 ≥ 1 and that is the relative estimation error.
3.1 Solution for complete information: r → 0 The limit r → 0 corresponds to T N . This means we have much more data than the dimension, so in this limit we have to recover the results of Section 2.
From (30), (31), and (32) we see that w w vanishes. (In the small r limit λ and q 0 will be seen shortly to be of O(1), while ∆ of O(r).) Then, in the limit r → 0 the arguments of the Ψ functions in (24) go to +∞ and −∞, respectively. For large x, Ψ(x) ∼ x, and Ψ(−x) is exponentially small, so (24) yields which, by (30) and (32), leads to in accordance with (11).
Finally, from (26) we obtain q 0 by noting that, for large x, W (x) ∼ x 2 /2 and W (−x) is exponentially small: Equation (40) then implies that for r → 0,q 0 → 1, which means that the relative estimation error vanishes, a natural result in the limit T /N → ∞.
The value of the objective function at the stationary point is obtained by substituting the above results into equation (33): in agreement with (14). Let us turn to the distribution of weights now. As the argument of the Φs in (35) go to infinity for r → 0, the Φs themselves go to 1, so n 0 vanishes in this limit.
From (30) and (31) we see that for r → 0 both set of weights w which is the same as the optimal weights found in (12).
In the same limit the standard deviations given in (32) vanish, so the Gaussians in (34) go over into Dirac delta functions. Since in the third term in (34) the delta spikes are multiplied by θ(−w), they do not contribute, so the distribution of weights in the r → 0 limit becomes where w * i is the true optimal weights given in (12). We see then that in the limit r → 0 our results derived via the replica method perfectly coincide with the results found in Section 2, thereby providing an important consistency check.

Including a riskless asset
If one of the assets, say the first, is riskless, σ 1 → 0, then it must take on the full weight of the portfolio. (Remember that we have no constraint on the expected return of the portfolio, and are looking for the global minimum of the risk functional. Therefore, if there is a riskless asset in the portfolio, the total wealth must be invested in this asset.) Let us see how our equations lead to such a result.
As we will see later, above r = 1 zero modes (eigenvectors belonging to the proliferating zero eigenvalues of the estimated covariance matrix) appear in the system, and they start competing with the riskless asset. We will study these zero modes later; in the present subsection we restrict the discussion to the range r < 1, to avoid the complications related to the zero modes. In order to ensure that our formulae remain meaningful, we assume that the product N σ 2 1 stays finite as σ 1 → 0 and N → ∞. Now, if σ 1 is much smaller than the other variances, in (26) a single term dominates, and using the asymptotic Similarly, from (24) we get Equations (44) and (45) imply Then by (36) the quantityq 0 given in (37) is As stated in (38), √q 0 − 1 is the relative estimation error, so (47) means that the portfolio concentrated on the single riskless asset i = 1 is error free -an obvious result.
From (30), the weight w which, by (45) leads to so the riskless asset carries the total weight, indeed. The only other weight that could compete with this is w (1) 2 , but it is positive and is multiplied by θ(−w) in the weight distribution, so it does not contribute, while all other weights are negligible in the σ 1 → 0 limit.
Although according to (49) the riskless asset carries all the weight in the limit σ 1 → 0, some small fluctuations still remain. The standard deviation given in (32) works out to be corresponding to Gaussian fluctuations about the average (49). The above consideration is easily extended to the case when a few assets are much more stable than the others, in that their true volatilities are much smaller than those of the rest. Then this group of low volatility assets will take on almost all of the total weight N and share it among themselves in inverse proportion to their true variances, while the rest will receive only a small fraction of the weight. This distinction is encoded in the probability distribution of returns, and would show up in the estimated weights even in the absence of the regularizer. The regularizer's role becomes important when there is no such strong separation between the assets in the portfolio, and this is the case we are turning to now.

Elimination of assets by lasso
Before proceeding, we wish to emphasize again that we are calculating averages over the random samples, rather than trying to infer the behavior of the whole ensemble from studying a single sample. The difference is perhaps the most clearly seen in the case of the distribution of optimal portfolio weights. The lasso is known to eliminate some of the variables (setting their weights to zero). For a given sample with a given ratio r = N/T this happens step-wise, i.e. as we increase the strength of the regularizer η (setting η 1 = η 2 = η for simplicity) first one, then two, three, etc. weights will be rendered zero, in descending order of the corresponding variances. In contrast, the averaging over the samples in our formalism results in a density n 0 of zero weights that increases continuously with η and, according to (35), receives contributions from each asset i.
In Figure 1, we show numerical results for a twovariance portfolio and compare them to the results of the replica calculation. The numerical model is constructed from N = 100 assets, each having T = 300 data points (r = 1/3) drawn from a normal distribution. The variance of the returns is set to be σ 2 2 = 1 for half of the assets, while the other half has σ 2 1 = 10. As expected, the 1 regularizer mostly eliminates the weights associated with the higher variance group: in the right hand side figure n Let us now see what our theory has to say about the probability of the elimination of an asset depending on its variance. In line with what is suggested by the above measurement, one expects that more volatile assets will be removed with larger probability than the less volatile ones, that is the contribution to n 0 from asset i will be larger than that from asset j if σ i > σ j . Thus, we have to show that If we introduce the notations then from (30)-(32) we see that and (The constant a is obviously positive, and it follows from (9) and (33) that λ ≥ η 1 , so b is non-negative.) If we call z i = z, the other three variables are simply proportional to it: y i = bz, z j = az, y j = abz. The inequality (51) can then be written as The definition of Φ, (27), then leads to so f (z) = z bz dte −t 2 /2 must be a decreasing function of z. But df dz = e −z 2 /2 − e −b 2 z 2 /2 < 0, indeed, because b < 1. Thus, we have shown that more volatile assets are eliminated from the portfolio by 1 with higher probability.

Resolution of portfolio weights
Turning now to the distribution of non-zero weights, we see from (34) that the discrete spikes in (43) split into two and get broadened by averaging over the samples. Figure 2 is an illustration of p(w) in the special case when all the standard deviations σ i are the same, σ i = 1 for all i and η 1 = η 2 = η.
As we can see, with increasing r the Gaussians making up the distribution of weights become broader and broader, and the original sharp structure of p(w) becomes washed away.
The question arises how small r must be in order to make it possible to resolve the structure of the weight distribution of a portfolio consisting of, say, just two classes of assets, N/2 assets with volatilities σ i and N/2 with σ j . A glance at Figure 3 shows that this is possible as long as the distance between the centers of the two Gaussians is larger than the mean of their standard deviations. From Figure 3 it is also clear that it is sufficient to consider the positive weights side of the distributions, so the requirement for resolvability is that is where we have assumed σ j < σ i . When we have a large number of observations, i.e. r 1, λ − η 1 , q 0 , and ∆ can be replaced by their r = 0 values, as given in (39), (40), and ∆ = 0, respectively. Then the resolvability of the two peaks will only depend on r and the two volatilities, and the criterion of resolvability becomes where we have substituted σ i for half of the assets and σ j for the other half. It is then clear that for small r's the inequality (57) is easily satisfied for σ's sufficiently far apart. However, as will be seen shortly, with increasing r the coefficient (λ − η 1 )(1 + ∆) on the left of (57) decreases rapidly, and the inequality gets violated: sample fluctuations will wash the structure away. When one tries to estimate the portfolio weights from a single sample of empirical data, one is effectively picking the weights from the multimodal distribution p(w) (like the distribution in Fig. 4, but with many more peaks). If the peaks are well separated and narrow, the estimates so obtained will be closed to the true weights, but this assumes small values of r, that is a large number of observations T . If T is not very large compared to N , the distribution of weights will lose its discrete structure, and the estimated weights may have very little to do with their true values. Figure 4 shows how much the discrete structure is already lost for r = 0.3, while for r = 0.9 there is no way to resolve the structure.

Results in the high-dimensional regime
In this subsection, we present results for the range of N and T values where their ratio is neither very small nor very close to r = 2. While at the two extremes it is easy to get analytic results by hand, in the intermediate r range one has to solve the first order conditions by the help of a computer. The results will be displayed below in a few figures. For comparison, the results for η 1 = η 2 = 0 (no regularization) and η 1 = 0, η 2 → ∞ (no short positions allowed) are also shown. When the full regularizer is applied we set η 1 = η 2 = η, for simplicity. Also, since we have already displayed the results that depend on the heterogeneity of the portfolio (dominance of the riskless asset, preferential elimination of the large volatility items and the condition for the resolvability of nearby volatilities), we can henceforth set σ i = σ = 1 for all i, for simplicity again.
Without regularization the optimization of variance does not have a meaningful solution beyond r = 1 where the first zero eigenvalues of the covariance matrix appear. Then q 0 and ∆ diverge in the limit r → 1 − 0, while λ and the in-sample estimate for the objective function vanish at r = 1. In the absence of regularization, the density n 0 of zero weights is identically zero. Regularization extends the region where the optimization can be carried out, from 0 ≤ r < 1 to 0 ≤ r < 2. We can see from Figure 5 that for small values of the coefficient η of the regularizer n 0 is very small for r < 1, but starts increasing fast above r = 1, ultimately going to 1/2. Note that n 0 can be directly measured by numerical simulations; the agreement between the replica calculation and numerical simulation has already been shown in Figure 1.
There is a simple relationship between the density n 0 of zero weights and the order parameter ∆. From equations (25) and (35) one can see that The rapid growth of n 0 above r = 1 translates into a strong increase of ∆. (Without regularization ∆ would diverge at r = 1.) With the regularizer on and n 0 going to 1/2 as r → 2 − 0, ∆ ultimately diverges at r = 2. Equation (59) can serve as a recipe for the numerical determination of ∆ through n 0 .   Figure 6 shows the behavior of the order parameter q 0 related to the estimation error and out-of-sample estimate for the objective function; q 0 is a quantity that can be obtained directly from simulations, the analytical and numerical results are compared in Figure 6 for various values of η. Without the regularizer q 0 would diverge at r = 1, similarly to ∆. As a vestige of this, for small values of the regularizer's coefficient η, q 0 shows a strong "resonance" around r = 1, but remains finite, and decreases above r = 1 to a finite limit. For larger η's the resonance is suppressed, in particular, in the no-short-selling limit (η 2 → ∞)q 0 is monotonically increasing over the entire interval 0 ≤ r < 2.
Finally, the in-sample estimator for the objective function f can be obtained from (33) through calculating λ from the stationarity conditions. The results for λ are exhibited in Figure 7. We shall see shortly that f goes to η 1 as r → 2 − 0, implying that the variance vanishes in this limit.

Contour maps of estimation error
In order to assess the performance of regularization, we have to construct the contour lines of the estimation error. For simplicity we consider here a uniform portfolio with all the true variances σ 2 i = 1, and for a first orientation let the left hand side slope η 2 of the regularizer go to infinity and keep the right hand side slope η 1 finite. The advantage of such an arrangement is that it excludes all the negative weights: w (i) 2 defined in (31) goes to infinity, (24), (25) and (26). This leads to the much simplified set of equations: Applying the identity W (x) = x 2 Ψ(x) + 1 2 Φ(x) in the last equation and using the previous two, after some simple manipulations one is led to the result that the arguments of the functions Ψ, Φ and W above are equal to λ−η1 2r . Then the equations themselves become The last equation gives the solution for λ as where W (−1) is the inverse of W . With r increasing λ is decreasing and goes to η 1 for r → 2. As we have seen earlier, λ cannot be smaller than η 1 , so the square root remains real, and r cannot grow beyond 2. Substituting (66) into (64) and (65), respectively, we get the other two order parameters as functions of r. At first it may seem surprising that they depend only on r and do not depend on η 1 at all. (A little reflection shows that this is due to the combined effect of the exclusion of short positions and the budget constraint.) In particular, the order parameter q 0 , which determines the out-of-sample estimator for the objective function and the estimation error, works out to be This is independent of η 1 , but, of course, not independent of the regularization. Without regularization (and a very strong one at that; remember that we let η 2 → ∞ at the beginning of this subsection) we would have q 0 = 1 1−r which is blowing up at r = 1, whereas (67) smoothly increases from 1 to π as r goes from zero to 2. Because q 0 is independent of η 1 , if we constructed the contour lines of q 0 , i.e. the lines of fixed q 0 on the r − η 1 plane, we would get a series of horizontal lines stacked above each other.
(The above solution taken at η 1 = 0 is the optimization of the variance with a no-short constraint that we studied in [23].) When η 2 is finite we have to resort to a computer to construct the contour lines of q 0 . Now we set η 1 = η 2 = η, that is we consider a symmetric regularizer. The resulting q 0 contour lines are depicted in Figure 8a (This figure contains the same information as Fig. 6: the difference is that there q 0 was shown as a function of r, with the value of η as the parameter of the curves, while in here we are showing the constant q 0 lines on the r − η plane, with q 0 as the parameter.) We recognize the nearly horizontal contour lines immediately: in the lower regions of the figure (below r = 0.3, say) the lines of fixed q 0 are nearly independent of the strength of the regularizer. As we go higher, the effect of the regularizer starts to be felt more and more. The estimation error ( √ q 0 − 1) on the first five contour lines, from bottom up, is 5%, 10%, 20%, 30%, and 40%, respectively. These lines are nearly horizontal, which means that if we have enough data the strength of the regularizer hardly matters at all, the error would be almost the same even for η = 0. The first line where we can see a definite increase of r with η is the one corresponding to the relative estimation error 0.4. Beyond this point the regularizer is taking over and the estimation error for a large enough η is determined more by the regularizer than the size of the sample. We see then that either we have a sufficient amount of data and then the regularizer does not play a very important role, or it does, but by then the error is so large as to make the whole optimization pointless. In the higher regions of the contour map, the optimization is completely determined by the regularizer. The highest contour line corresponds to q 0 = π with r hitting its critical value of 2. For q 0 increasing further r must decrease (see Fig. 6). With q 0 going to infinity the contour lines shrink to the point η = 0, r = 1, corresponding to the singularity of the unregularized problem.

The critical behavior at r = 2
Let us start the analysis of the critical point with equation (26) and consider the general case where η 1 is different from η 2 ; we will see that η 2 does not appear in the results around r = 2. From equations (26), (30), (31) and (32) it is clear that the limiting behavior of the various quantities depends on λ and ∆, because q 0 remains finite here. The order parameter ∆ diverges for r → 2 − 0, but we will verify later that λ − η 1 goes to zero so fast that the product (λ − η 1 )(1 + ∆) still vanishes at r = 2. At the same time Then equation (26) becomes or, to leading order in = 2 − r, which shows that the product (λ − η 1 )∆ vanishes like ∼ indeed. Similarly, from equation (24) with Ψ(0) = 1 √ 2π we get near r = 2 Accordingly, the r → 2 − 0 limit of the relative estimation error given in (36) is where the expression multiplying π is, by force of the Cauchy inequality, larger or equal to 1 for any distribution of the true volatilities σ i , thereforeq 0 is larger than 1 for any distribution of the volatilities σ i , as it should. The asymptotic behavior of the order parameter ∆ in (25) can be worked out similarly to obtain where use has been made of Φ(x) = 1 2 + x √ 2π + . . . , for x small.
Going back to (71) and using (72) and (74) we find vanishing quadratically for r → 2 − 0. For the density of the zero weights we find in the same limit.
Turning to the distribution of weights, we see that w (i) 1 → 0 for all i, so, in addition to the δ-peak at the origin, all the positive weights collapse to zero, but with a finite standard deviation As for the weights w (i) 2 , they all go to infinity, so the corresponding contributions to (34) vanish exponentially.
Finally, the objective function can be obtained from (33). Here, the second term vanishes because of the divergence of ∆, while according to (75) the first term goes to η 1 , so As we see, η 2 does not appear in any of the results near the critical point, but it is important to realize that its nonzero value ensures the vanishing of all the contributions with w (i) 2 . What is happening at the transition at r = 2? To find the answer, we have to go back to the discussion below equation (6) where we found that the objective function f ≥ η 1 and the equality only holds when the empirical variance vanishes and the optimal weight vector lies on the simplex. But then equation (78) implies that it is precisely this what is happening at the critical point. According to equation (6) the variance is the sum of T squares. This vanishes only if each T term vanishes separately. So, we need to find a weight vector that is pointing to the simplex and is orthogonal to the T random return vectors. This is exactly the same random geometry problem that we encountered in the case of the no-short-constrained optimization [23]. There we displayed a closed formula, valid for any N and T , for the probability of finding such a vector: This formula depends only on the symmetry, and not on the concrete form of the return distribution, and as such, it is universal. For N ≤ T the probability of finding such a solution is zero. For N exceeding T the probability starts to increase, becomes 1/2 at N = 2T and goes to one as N increases further. If N and T go to infinity with their ratio r = N/T held fixed, the function p(N, T ) goes over into a step function: the probability that the variance vanishes becomes zero for 0 < r < 2 and 1 for r > 2. Thus, the critical point at r = 2 corresponds to a sudden transition between a situation where the variance is positive and one where it is zero with probability one, while the objective function becomes identically equal to η 1 , corresponding to a flat optimization landscape. This transition is similar to the large number of phase transitions in random highdimensional geometry studied in [26] and [27].

Summary
We have considered the optimization of variance supplemented by a budget constraint and an asymmetric 1 regularizer. The present treatment includes as a special case the no-short-constrained portfolio optimization problem [23]. We have presented analytical results for the order parameter q 0 , directly related to the out-of-sample estimator of the objective function and the relative estimation error; for the in-sample estimator of the objective function; for the density of the assets eliminated from the portfolio by the 1 regularizer; and for the distribution of portfolio weights. We have studied the dependence of these quantities on the ratio r of the portfolio's dimension N to the sample size T , and on the strength of the regularizer. We have checked these analytic results by numerical simulations, and found general agreement.
As the most conspicuous property of 1 is the step-like, one by one, elimination of the dimensions, we also run numerical experiments on single samples to reproduce this phenomenon. We have confirmed the appearance of the steps, and checked that the overall trend of the numerical results by and large follows the theoretical curve, which is remarkable, since the measurement was carried out on a single sample of finite size, whereas the theory is meant to work in the limit where both N and T go to infinity and the results are averaged over the whole ensemble of random samples. We have also seen that averaging over merely ten numerical curves is already enough to remove most of the fluctuations. We have repeatedly emphasized that the replica theory we applied in the analytic work is designed to average over infinitely many samples, and thus the results reflect the typical properties of the ensemble. Empirical work, in contrast, is usually dealing with a single sample, or a small number of samples, and tries to infer the properties of the ensemble from the information contained therein. Considering the rapid broadening with r of the Gaussians making up the distribution of weights, one can immediately see how misleading a small number of samples can be. As portfolio optimization is just a simple representative example of quadratic optimization, our results have a message for these kind of optimization problems at large.
The extension of the interval where the optimization can be carried out, the maximal proportion of one half of dimensions eliminated by 1 and the "resonance" of the estimation error around the unregularized critical point at r = 1 are important findings -as is the disappointing performance 1 in the given context. The poor performance should not be a surprise, as in the given problem we were trying to rein in large fluctuations of a quadratic objective function by a regularizer which increases linearly. The phase transition taking place at r = 2 belongs to the large family of transitions in random geometrical problems studied in [26] and [27] where they were shown to be universal in the sense that the critical point is independent of the distribution of the data. As a manifestation of this universality, the critical value r = 2 does not depend on the Gaussian nature of the returns that we assumed here for the sake of easy application of the replica method.
We wish to point out that the present problem can be viewed as a statistical field theory model of a rather simple kind: an N -component φ 2 model in zero spatial dimensions with random masses. Because of the lack of a higher order term (such as a φ 4 ) in the model, there is no restoring force beyond the phase transition, instead we have a flat energy landscape. The budget constraint is a somewhat unusual feature of the model. If we replaced it by a spherical constraint, for example, we would immediately recognize the zero modes of the system as the Goldstonemodes. With the budget constraint combined with the 1 regularizer the sphere is replaced by the simplex, and the Goldstone-modes become the zero modes freely roaming about this invariant set, and the order parameter ∆ blowing up at the r = 2 transition with an exponent −1 can be regarded as a kind of susceptibility associated with the transition. The extension of the concept of universality beyond the universality of the critical exponent also to the universality of the critical point itself is due to the purely geometric nature of the transition.
To conclude, we would like to call attention to the fact that the transition at r = 2 is very easy to overlook in empirical work. Upon approaching this critical point, the solution of the optimization problem as posed here would become unstable against "transverse" fluctuations which would leave the length of the weight vector approximately constant, but would result in large fluctuations in its direction. This corresponds to the weight vector freely roaming over the simplex. In finance terms, it would mean the optimal portfolio ending up with a different composition in each sample. It is clear that such a situation is undesirable (such a frequent rebalancing of the portfolio would be technically difficult and would result in high transaction costs), so the investor should keep well away from the point of instability. In numerical work, however, one may use, even inadvertently, some of the standard solvers that often contain a built in 2 regularizer without a clear warning about it. The presence of such "hidden" 2 regularizers in standard quadratic solvers has been pointed out in [23]. Such a regularizer will stabilize the solution and drive it toward the naive portfolio with all the weights equal. In a situation where the original problem is unstable even a very small 2 regularizer will suffice to do the job, thereby creating the illusion that a stable solution can be obtained on the basis of a small number of observations.
In this paper, we have discussed the problem of optimizing a portfolio of independent assets, which is the worst possible case from the point of view of the sparsity of the underlying process that generates returns. Some of our findings, e.g. the value of the critical point or the number of eliminated dimensions, will certainly be modified when correlations between the assets are taken into account. Indeed, we expect the critical point to be shifted towards values of r larger than 2, and the maximum number of items that can be set to zero to increase beyond N/2 if the variance of returns is dominated by a few principal components. The study of the effect of correlations is however left for a later work.

Author contribution statement
All three authors contributed to the project design. F. Caccioli did most of the derivation of the replica free energy, I. Kondor did most of the solution of the saddle point equations, and F. Caccioli and G. Papp did the numerical calculations. I. Kondor wrote up most of the manuscript.
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Appendix A: Derivation of the free energy with the replica method As stated in the main text, (8), we have to find the optimum of the following objective function: where the returns x it are drawn from the joint probability density of independent Gaussian variables with zero mean and variance σ 2 i . Any optimization problem can be embedded into the formalism of statistical physics by regarding the objective function F as the "energy functional" of a fictitious system, introducing a fictitious inverse temperature γ, and integrating the Boltzmann factor e −γF over the coordinates x it in a given sample to get the "partition function" Z. The logarithm of the partition function ln Z is essentially a cumulant generating function from which all the quantities of interest can be obtained; in particular, the optimal weights can be found by minimizing the partition function over the weights in the "zero temperature" limit γ → ∞. The effectiveness of this procedure depends on the fact that we work in the limit of large N s where the distribution in the space of returns is extremely sharp around its maximum. The procedure just described gives us the optimal weights in a given sample of size T . However, if T is not much larger than the dimension N of the portfolio we are in the realm of high-dimensional statistics, where sample fluctuations are large, and optimizing our portfolio over a single sample can be very misleading. Therefore, in order to capture the typical properties, we have to average over the full ensemble of samples. This is analogous to averaging over the "quenched" random samples in the statistical physics of disordered systems [20], which explains why the methods developed in that theory can be successfully applied in the portfolio optimization context.
In order to average over the samples, we have to average the logarithm of the partition function which is a random variable fluctuating from sample to sample. Averaging the logarithm of a random variable is hard, while calculating the integer moments Z n may be feasible. Now Z n is just the partition function of n independent copies or replicas of the system (hence the name of the method). Assuming that we can analytically continue Z n from the integers to real ns we can make the use of the identity (ln Z) n = Z n − 1 n , (A.2) valid in the limit n → 0. Of course, the analytic continuation of a function from the integers to the reals is not necessarily unique. It is plausible, however, to assume that in the case of a convex objective function like that in (A.1), in the limit of large N all the replicas will go to the same minimum of ln Z, and the simplest analytic continuation will do the job. Because we cannot provide a rigorous proof of this claim, we should regard the results of the replica calculation as heuristic. This is why we performed extensive numerical simulations to back up the analytic results in this paper. The general agreement we found is clear evidence of the correctness of the results. On the other hand, to deduce the nontrivial results from a purely numerical approach would have been obviously very hard if not impossible.
Let us now carry out the program sketched above. The replicated partition function is where · · · represents an average over the probability distribution of returns. The above partition function refers to a system of n replicas of the original system, and the index a is introduced to label different replicas, so that w a i represents the ith weight of the ath replica. Introducing an integral representation for the delta function and using the properties