A simulation-based methodology for evaluating hedge fund investments

This article introduces a large scale simulation framework for evaluating hedge funds’ investments subject to the realistic constraints of institutional investors. The method is customizable to the preferences and constraints of individual investors, including investment objectives, performance benchmarks, rebalancing period and the desired number of funds in a portfolio and can incorporate a large number of portfolio construction and fund selection approaches. As a way to illustrate the methodology, we impose the framework on a subset of hedge funds in the managed futures space that contains 604 live and 1323 defunct funds over the period 1993–2014. We then measure the out-of-sample performance of three hypothetical risk-parity (RP) portfolios and two hypothetical minimum risk portfolios and their marginal contributions to a typical 60–40 portfolio of stocks and bonds. We find that an investment in managed futures improves an investor’s performance regardless of portfolio construction methodology and that equal risk approaches are superior to minimum risk portfolios across all performance metrics considered in the study. Our article is relevant for institutional investors in that it provides a robust and flexible framework for evaluating hedge fund investments given the specific preferences and constraints of individual investors.


INTRODUCTION
The hedge fund industry represented about US $3 trillion in assets under management (AUM) during the first quarter of 2015 according to the BarclayHedge Group. Therefore, hedge funds represent a significant portion of the portfolios of institutional investors with direct investments of $2.5 trillion and an additional $500 billion allocated through funds of funds. While there is a rich literature on quantitative approaches to portfolio construction, most studies fail to appropriately account for investment practices and, therefore, are not directly applicable for institutional investors who have their own unique set of investment constraints and preferences. Molyboga et al (2015), henceforth MBB, suggest that institutional investors cannot directly benefit from academic studies on hedge fund performance persistence because the studies: (i) ignore the delay in hedge fund reporting, thus relying on information that is not available at the time of investment decisions, 1 (ii) consider funds that have AUM that are too small for institutional investors, (iii) include funds with very short track records, and (iv) assess portfolios with too many constituent funds to be practical. 2 MBB apply a one-month lag parameter to account for the reporting delay, impose AUM and track record length requirements, and introduce a simulation framework that limits the number of funds within a portfolio and then evaluate out-of-sample performance using a stochastic dominance framework.
This article uses a simulation framework similar to MBB but applies it to the evaluation of portfolio construction approaches subject to real life constraints and uses additional performance metrics that evaluate the marginal portfolio contribution of a hedge fund portfolio to an investor's original portfolio. This study intentionally uses a few commonly used measures of performance to illustrate that the framework is not limited to a single measure. The framework is customizable to the preferences and constraints of individual investors regarding rebalancing periods and the desired number of funds in a portfolio and can incorporate a large number of portfolio construction and fund selection approaches. The methodology produces implementable results because it explicitly accounts for the hedge fund reporting delay reported in MBB and applies an in-sample/out-of-sample framework that incorporates common investment constraints when creating and rebalancing portfolios. The framework imposes the standard requirements of institutional investors regarding track record length and the amount of AUM. It also limits the number and turnover of funds in the portfolio by assuming that the institutional investor selects a discrete number of funds that stay in the portfolio until they no longer satisfy selection criteria. 3 The methodology utilizes a simulation framework to account for a large number of feasible portfolio constituents in each period.
We evaluate out-of-sample performance with several commonly used measures of standalone performance and marginal portfolio contribution. 4 Standalone performance measures include annualized return, Sharpe and Calmar ratios, maximum drawdown 5 and the t-statistic of α with respect to the Fung-Hsieh (2001) five-factor model. We measure marginal portfolio contribution by evaluating the improvement in Sharpe and Calmar ratios 6 by replacing a modest 10 per cent of the original investor's portfolio with a 10 per cent allocation to a simulated hedge fund portfolio. In this article, we consider a standard 60-40 portfolio of stocks and bonds as the original portfolio, but the framework is flexible to the choice of investor benchmark.
Standard statistical techniques are inappropriate for the evaluation of outof-sample performance since simulation results are not independent, driven rather by the overlap in portfolio constituents across simulations. We apply the bootstrapping methodology of Efron (1979) and Efron and Gong (1983) to estimate the sampling properties of the test results and draw statistical inferences about the relative performance of portfolio methodologies. Opdyke (2007) introduces an analytic formula for the asymptotic distribution of Sharpe ratios under very general conditions that include non-normal distributions, time-varying volatility and serial correlations. This approach is particularly powerful when applied to a single return series such as in asset allocation studies that use single indices for each asset class. For example, studies by Kat (2004), Lintner (1996), Abrams et al (2009) and Chen et al (2005) that demonstrate positive contribution of managed futures to traditional portfolios can directly benefit from utilizing the methodology introduced in Opdyke (2007). By contrast, the simulation methodology of this article produces many time-series and we select a boostrapping approach because it can be applied to any performance measure while accounting for lack of independence in simulation results.
We impose the framework with 10 000 simulations on a data set of 604 live and 1323 defunct Commodity Trading Advisors (CTAs) over the period 1993-2014. CTAs, a subset of hedge funds that has grown exponentially over the past 35 years, 7 is known for its historically strong performance during times of market crisis, notably the Financial Crisis of 2008, and, therefore, serves as a particularly interesting subset of hedge funds from a portfolio diversification perspective. We evaluate several popular risk-based approaches that include two minimum risk and three RP methods. While the approaches we consider are commonly used by both practitioners and academics, they are only a few of the portfolio construction approaches that can be evaluated within the framework. The methodology can be extended to a large number of quantitative portfolio construction approaches.
Our results are striking because an investment in CTAs improves performance regardless of the choice of the portfolio construction approach. For the out-of-sample period between January 1999 and December 2014, a 10 per cent allocation to managed futures improves the Sharpe ratio of the original 60-40 portfolio of stocks and bonds from 0.376 to 0.399-0.416 on average, depending on the portfolio construction methodology employed. Similarly, the Calmar ratio improves from 0.092 to 0.100-0.108 on average. Blended portfolios have higher Sharpe ratios in at least 89 per cent of simulations and higher Calmar ratios in at least 89.5 per cent of simulations. Our findings are consistent with Kat (2004), Lintner (1996, Abrams et al (2009) andChen et al (2005) that report a positive contribution of managed futures to traditional portfolios.
Minimum risk portfolios perform the worst for all performance metrics. For example, their average Sharpe ratios are between 0.299 and 0.304, significantly lower from both an economic and statistical perspective than the 0.319 average Sharpe ratio of the random portfolios. By contrast, equal risk methodologies deliver superior average Sharpe ratios of 0.342-0.362. Our results are consistent with DeMiquel et al (2009) who find that an equal notional allocation (EN), which we consider an equal-risk approach, is superior to a minimum variance allocation (MV), which we consider a minimum risk approach.
We have performed a sub-sample analysis to evaluate the marginal contribution of a 10 per cent allocation to managed futures during periods of relatively poor and relatively good performance. A 60-40 portfolio produced a Sharpe ratio of −0.09 and a Calmar ratio of −0.025 during the relatively poor period between 1999 and 2008. During the exceptional period between 2009 and 2014, the benchmark portfolio delivered a Sharpe ratio of 1.12 and a Calmar ratio of 0.77. 8 During the 10-year period between 1999 and 2008, all portfolio construction approaches would have added value as measured in terms of average Sharpe and Calmar ratios with the equal-risk portfolios producing the best results.
During the 6-year period between 2009 and 2014, the blended portfolios have approximately the same average Sharpe ratios and slightly better Calmar ratios than the benchmark portfolio with the minimum risk approaches delivering the best results. The sensitivity of relative results to an evaluation window is common. For example, Anderson et al (2012) compare four investment strategies and find that the specific start and end dates of a backtest can have a material impact on the results. This study introduces a methodology that can be used for the evaluation of portfolio construction approaches with real-life constraints and is strengthened by the sub-sample analysis.
Our findings and methodology are relevant for institutional investors who might consider investing, or who are already currently invested, in hedge funds and managed futures because the framework can be customized to the specific preferences and constraints of investors to maximize the benefits of hedge fund portfolios.
The remainder of the article is organized as follows: The next section describes the data and accounts for biases; the subsequent section discusses the risk-based approaches and introduces the large-scale simulation framework; the section after that presents empirical out-of-sample results; and the final section concludes.

DATA
There are several commonly used CTA databases: BarclayHedge; CISDM (formerly the MAR database); Lipper (formerly TASS); and Eurekahedge. Joenvaara et al (2012) perform a comprehensive study of publicly available databases of hedge fund returns and report that Barclay Hedge provides the highest quality data out of the databases considered. Moreover, the BarclayHedge database is the largest publicly available database of CTAs with 1013 active and 3660 defunct funds over the period from December 1993 to December 2014. Therefore, we use BarclayHedge for this study as it is the most comprehensive and highest quality publicly available database of CTA returns.
We perform a number of filtering steps to ensure data quality and limit the scope of the study to the funds that would be appropriate for institutional investors who are interested in making direct investments. We explicitly account for the survivorship, backfill, incubation and liquidation biases that are common within CTA and hedge fund databases. 9 We include the graveyard database that contains defunct funds to account for the survivorship bias. The backfill and incubation biases arise because of the voluntary nature of self-reporting. 10 We use a combination of two approaches to mitigate these biases. The first methodology, suggested by Fama and French (2010), limits the tests to those funds that managed at least $10 million in AUM normalized to December 2014 values. Once a fund reaches the AUM minimum, it is included in all subsequent tests to avoid creating selection bias. Unfortunately, many CTAs, including very successful and established ones, originally reported only net returns for an extended period of time before their initial inclusion of AUM data. Using Fama and French (2010) methodology exclusively would completely eliminate large portions of valuable data for such funds. To include this data, we apply the technique suggested by Kosowski et al (2007), which eliminates only the first 24 months of data for such funds. We use the liquidation bias estimate of 1 per cent as suggested in Ackermann et al (1999). After accounting for the biases, our data set includes 604 live and 1323 defunct funds for the period between December 1995 and December 2014.
We use the Fung-Hsieh five factor model of primitive trend following systems, introduced in Fung and Hsieh (2001), as benchmarks in measuring the performance of CTA portfolios. The factors include PTFSBD (bonds), PTFSFX (foreign exchange), PTFSCOM (commodities), PTFSIR (interest rates) and PTFSSTK (stocks) while the 3-month Treasury bill (secondary market rate) series with ID TB3MS from the Board of Governors of the Federal Reserve System serves as a proxy for the risk-free rate. Table 1 reports summary statistics and tests of normality, heteroscedasticity and serial correlations in CTA returns by strategy and current status. Anson (2011) suggests that the 60-40 portfolio of stocks and bonds represents a typical starting point for a US institutional investor. In this article, this blend is constructed using the S&P 500 Total Return index and the JPM Global Government Bond Index. Table 2 reports the annualized excess return, standard deviation, maximum drawdown, Sharpe ratio and Calmar ratio of the 60-40 portfolio for 1999-2014. Over this time period, the portfolio delivered a Sharpe ratio of 0.376 and a Calmar ratio of 0.092. Figure 1 shows the performance of the portfolio from January 1999 to December 2014.
Although the 60-40 portfolio of stocks and bonds has been used extensively in the literature as a benchmark portfolio, the framework is flexible and can incorporate any investor-specific portfolio as a benchmark.

METHODOLOGY
In this section, we define the risk-based approaches considered in this study. Then we introduce a large-scale simulation framework with real-life constraints used to generate out-of-sample portfolio returns. Finally, we describe the performance metrics used to compare out-of-sample results.

Review of risk-based approaches
In this article, we evaluate two minimum risk and three equal-risk (or RP) approaches. 11 While the approaches we consider are commonly used by practitioners and academics, they are used merely as examples of portfolio construction approaches that can be evaluated within the framework. The methodology can be extended to a large number of quantitative portfolio construction approaches. Minimum risk portfolios include the MV approach with non-negative constraints documented in Jagannathan and Ma (2003) and a minimum semi-standard deviation (MDEV) approach that is similar to the MV approach but only considers negative returns. Equal-risk or RP, approaches include an EN approach, which is a naïve diversification 1/N method praised in DeMiquel et al (2009) and criticized in Kritzman et al (2010), an EVA approach highlighted in c (2012) and the classical risk parity (RP) approach extensively discussed in Maillard et al (2010), Clarke et al (2013) and Qian (2013). We apply a random portfolio selection approach (Random) that serves as a benchmark in evaluating the RP approaches. The approaches are evaluated using a largescale simulation framework with real life constraints.

Large scale simulation framework
In this, we utilize a modification of the large-scale simulation framework with real life constraints introduced in MBB. MBB apply the framework to evaluate persistence in hedge fund managers' performance and compare equally weighted portfolios of funds that rank in the top quintile based on the t-statistic of α with respect to a CTA benchmark (restrictive fund selection) against those of all available funds (random fund selection). By contrast, this article does not impose any ranking but rather focuses on the impact of choice of portfolio construction methodology on performance. The out-ofsample period is between January 1999 and December 2014, the longest out-of-sample backtesting period in CTA empirical research. The framework uses 10 000 simulations and a lag of one month to account for the delay in the performance reporting of CTAs. 12 Below we describe a single run of the  This table reports statistical properties of fund returns and residuals by strategy and current status. Column one presents the number of funds in each category. Columns two and three report the cross-sectional mean of the Fung-Hsieh (2001) five-factor model monthly α and the t-statistic of α. Columns four and five display the cross-sectional mean of kurtosis and skewness of fund residuals. Column six reports the percentage of funds for which the null hypothesis of normal distribution is rejected by the Jarque-Bera test.
Column seven reports the percentage of funds for which the null hypothesis of homoskedasticity is rejected by the Breusch Pagan test. Column eight reports the percentage of funds for which the null hypothesis of zero first-order autocorrelation is rejected by Ljung-Box test. All tests are applied to fund residuals and the P-value is set at 10% level.
simulation framework and then show how simulation results are evaluated.
A single run of the simulation framework The in-sample/out-of-sample framework mimics the actions of an institutional investor who makes allocation decisions at the end of each month. The first decision is made in December 1998. Owing to the delay in CTA reporting, the investor has return information only through November 1998; thus, the investor considers all funds that have a complete set of monthly returns between December 1995 and November 1998. First, the investor eliminates all funds in the bottom quintile of AUM among the funds considered. This relative AUM threshold is more appropriate than the fixed AUM approach commonly used in the literature (for example, Kosowski et al (2007) use a fixed AUM level of $20 million) because the average level of AUM has increased substantially over the last 20 years. Then the investor randomly chooses 5 funds from the remaining pool of CTAs and allocates to them using the five risk-based approaches and a random portfolio allocation. Monthly returns are recorded for each portfolio construction approach for January 1999 using the liquidation bias adjustment for funds that liquidate during the month. At the end of January 1999, the pool of CTAs is updated and defunct constituents of the original portfolio are randomly replaced with funds from the new pool. Each portfolio is then rebalanced again using the original portfolio construction methodologies. 13 The process is repeated until the end of the outof-sample period of December 2014. A single simulation results in six out-of-sample return streams between January 1999 and December 2014one for each of the portfolio construction approaches.

Performance evaluation of out-ofsample results
Out-of-sample performance is evaluated using both standalone performance metrics and measures that consider portfolio contribution benefits. Standalone performance metrics include annualized return, maximum drawdown, Sharpe ratio, Calmar ratio, 14 Fung-Hsieh α and t-statistic of α. Performance contribution is measured as the resultant difference in Sharpe ratio and Calmar ratio from replacing 10 per cent of the original portfolio of stocks and bonds with portfolios of CTA funds constructed within the simulation framework. Since each performance measure is represented by a distribution that contains 10 000 values, distributions are compared using means and medians for all measures and the percentage of positive values for Fung-Hsieh α and the percentage of positive marginal Sharpe and Calmar ratios in the performance contribution measures. Since simulations are not independent, we apply a bootstrapping procedure to draw statistical inference.

Boostrapping procedure
The bootstrapping procedure follows each steps of the simulation framework but limits the set of portfolio construction approaches to the Random portfolio methodology to which we choose to compare all other approaches. 15 Each simulation set consists of 10 000 simulations. The bootstrapping procedure includes 400 sets of simulations, a sufficient number to estimate P-values with high precision. A comparison of the performance metrics of the original simulation to the bootstrapped sets of simulations gives the P-values reported in the empirical results section.

EMPIRICAL OUT-OF-SAMPLE RESULTS
In this section, we present information about the data set used in the simulation and out-of-sample results for the period between January 1999 and December 2014 generated by the large-scale simulation framework. Table 3 reports the average AUM threshold level for each year and the average number of funds meeting that threshold. The AUM threshold represents the 20th percentile of AUM among all active fund managers with a track record of at least 36 months.
There is a significant variation in the values of the AUM threshold over time which primarily reflects changes in AUM driven by industry growth and recent performance. The 2010 threshold value of $13.97 million is almost three times as high as the $5 million threshold value in 2001. The number of funds has nearly doubled over this time period representing substantial growth in the industry.
Analysis of out-of-sample performance of CTA portfolios as standalone investments We analyze distributions of out-of-sample returns over the complete data period using means and medians of several performance metrics. Since simulations are not independent, we use a bootstrapping Distributions of out-of-sample performance Table 4 reports means and medians for the distributions of returns, volatilities, Sharpe and Calmar ratios and maximum drawdowns for each portfolio construction approach. The P-values are estimated using the bootstrap methodology. The superscript star indicates that the performance measure of a given portfolio approach exceeds that of the RANDOM portfolio at 99 per cent confidence level. The subscript star shows that the performance measure of a given portfolio approach is lower than that of the RANDOM portfolio at 99 per cent confidence level. The minimum risk approaches tend to have the lowest volatilities of the portfolio methodologies considered in the study. MV and MDEV have mean volatilities of around 6.8 per cent, whereas, EVA and RP have volatilities of around 8.21 and 8.66 per cent, respectively, followed by EN and RANDOM with volatilities that exceed 11 per cent. However, the lower levels of volatility are not necessarily associated with lower drawdowns. For example, EVA has a maximum drawdown of 19.12 per cent, slightly lower than the 19.9 per cent maximum drawdown values of the minimum risk portfolios. Moreover, the minimum volatility approaches deliver low returns and risk-adjusted returns that are inferior to those of the other approaches. This finding is consistent with DeMiquel et al (2009) which documents the superior out-of-sample performance of the naïve 1/N (EN) approach relative to that of several extensions of mean-variance optimization including the MV approach. Jensen's inequality suggests the EN approach should dominate the RANDOM methodology in terms of Sharpe ratio because of the concavity of the Sharpe ratio. 16 The three equal-risk approaches have risk-adjusted performance which is superior to that of the RANDOM approach. In contrast, minimum risk approaches yield inferior results on average. Median values reported in Panel B show similar results.
While Table 4 presents mean and median values of several performance metrics, a complete evaluation of the portfolio construction methodologies should also consider distributions of out-of-sample performance. Figure 2 shows the distributions of Sharpe generated by the large-scale simulation framework for each portfolio methodology. Each distribution is visualized using a standard box and whisker plot with the box containing the middle two quartiles, the thick line inside the box representing the median of the distribution and the whiskers displayed at the top and bottom 5 per cent of the distribution. The breadth of each distribution demonstrates the key benefit of using a largescale simulation framework. Failing to account for the role of chance and evaluating portfolio construction techniques using a single stream, which represents a single draw of the distribution, can mislead investors about the relative performance of portfolio management techniques. Since the distributions are so wide, it might seem impossible to compare them with each other. Fortunately, it is not a new problem in Quantitative Finance and Decision Theory where expected utility and stochastic dominance methodologies are applied to compare distributions. The framework is flexible and can employ utility functions and stochastic dominance to evaluate results; however, this article only considers means and medians for the sake of brevity. 17 The minimum risk approaches, MV and MDEV, have the lowest median Sharpe and exhibit relatively large left tails. The equal risk approaches seem to perform better on average than the random portfolio methodology, but it is difficult to determine whether that relative performance is statistically significant, particularly since the standard statistical techniques are inappropriate due to dependence across simulation results. Therefore, we apply a bootstrapping procedure to estimate sampling distributions of the performance measures.
The P-values suggest that equal risk approaches (EN, EVA, RP) dominate RANDOM portfolios based on average Sharpe and Calmar ratios at a confidence level greater than 99 per cent (in fact, none of the 400 bootstrap simulations of RANDOM portfolios deliver superior average Sharpe and Calmar ratios). By contrast, minimum risk approaches (MV, MDEV) are inferior to RANDOM portfolios in terms of average Sharpe and Calmar ratios (all 400 bootstrap simulations of RANDOM portfolios yield superior average Sharpe and Calmar ratios). 18  The minimum risk approaches, MV and MDEV, underperform on average whereas the equal risk approaches, EN, EVA and RP, seem to outperform the RANDOM portfolio.
We utilize the Fung-Hsieh factor model introduced in Fung and Hsieh (2001) to account for the systematic risk exposures of hypothetical portfolios that might drive the above results. Table 5 reports mean and median values of Fung-Hsieh α and t-statistic of α and the percentage of positive αs for each portfolio methodology. The P-values are estimated using the bootstrap methodology. The superscript star indicates that the t-statistic of α of a given portfolio approach exceeds that of the RANDOM portfolio at 99 per cent confidence level. The subscript star shows that the t-statistic of α of a given portfolio approach is lower than that of the RANDOM portfolio at 99 per cent confidence level.
The minimum risk approaches, MV and MDEV, have mean t-statistics of α of around 1.59 which is lower than 2.26, the mean t-statistic of α of the RANDOM portfolio.   Fung-Hsieh t-statistic of α at the 99 per cent confidence level. Figure 4 shows the distributions of the Fung-Hsieh t-statistic of α for each portfolio methodology.
The minimum risk approaches have heavy left tails and underperform the other methodologies on average. Therefore, the three key metrics of risk-adjusted performance, whether Sharpe, Calmar or the Fung-Hsieh t-statistic of α, suggest that the minimum risk portfolios are inferior and the equal risk approaches outperform the RANDOM portfolio on average.

Analysis of the marginal performance contribution of CTA portfolios to the investor's original portfolio
In this section, we evaluate the marginal impact of an investment in CTA portfolios for investors who hold a benchmark 60-40 portfolio of stocks and bonds. The comparison is done using Sharpe and Calmar ratios calculated for blended portfolios against the investor's original portfolio. First, we consider marginal contribution by comparing the marginal change in performance of a 90-10 blended portfolio that replaces 10 per cent of the original portfolio allocation with the CTA portfolios from the simulation using Sharpe and Calmar ratios. Then, we investigate the impact of the allocation to the CTA portfolios on the performance of the blended portfolios.
Relative performance of a 90-10 blended portfolio Table 6 reports the average Sharpe and Calmar ratios of the blended portfolios and the percentage of simulations of blended portfolios that result in Sharpe and Calmar ratios that are superior to those of the original 60-40 portfolio.
The robustness of portfolio benefits stemming from an investment in CTAs is striking. Blended portfolios have higher Sharpe and Calmar ratios in at least 89 per cent of the scenarios among the worst performing minimum risk portfolios. Equal-risk portfolios have higher Sharpe and Calmar ratios in over 97 per cent of scenarios, and the improvement in average Sharpe ratios is as high as 10 per cent, with the original Sharpe improving from 0.376 to 0.41. Similarly, the equal-risk methodologies improve the average Calmar ratio by 10 per cent from 0.092 to over 0.1. Interestingly, a naïve diversification EN approach performs slightly better in terms of marginal performance contribution even though it marginally underperforms as a standalone investment. MBB perform analysis by market environment that can potentially give additional insight into the robustness of performance across market regimes. For brevity it is excluded here. 19 Analysis of marginal performance contribution is important, particularly when an investor already has exposure to a large number of systematic sources of return in his or her well-diversified portfolio. In that situation, strategies that harvest the same sources of return can look very attractive as standalone investments but do not improve the risk-adjusted return of the investor's portfolio. The framework employed here is flexible and can utilize an investor's existing portfolio as a benchmark against which the marginal contribution of hedge fund portfolios can be measured.
The impact of the size of the allocation to CTA portfolios on the performance of blended portfolios By evaluating the impact of allocation weights on performance, the framework can be used to optimally allocate to hedge fund portfolios given an investor's specific preferences and constraints. This study considers the performance of blended portfolios that have allocations between 5 per cent and 60 per cent to CTA investments. Table 7 reports the performance of blended portfolios stated in terms of Sharpe ratio. Panel A reports the percentage of simulations that improves the Sharpe ratio over the original 60-40 portfolio of stocks and bonds. Panel B reports mean Sharpe ratios and Panel C reports median Sharpe ratios of the blended portfolios.
Average Sharpe ratios increase until the allocation to CTA portfolios reaches 40-50 per cent and declines thereafter. However, the improvement that comes with a higher allocation to CTA portfolios also comes with a higher risk. While a MV portfolio improves the Sharpe ratio of the investor portfolio in 89.6 per cent of scenarios with a 5 per cent allocation to CTA portfolios, that number declines to 74 per cent at a 60 per cent allocation level. Similarly, the percentage of positive contribution scenarios declines from 98.7 to 81.6 per cent for the EN approach as the allocation to CTA investments grows from 5 to 60 per cent. Figure 5 shows the distribution of the out-of-sample Sharpe ratios of the blended portfolios.
It is important to note that the framework implicitly assumes that the performance of the investor's original portfolio can be expressed by a single time series or a single outcome, completely ignoring the role of luck because of active management decisions in the investor's portfolio. 20 A joint simulation of the investor's portfolio management techniques applied to the original portfolio constituents and the hedge fund portfolios has the potential to better account for luck in both types of investments but requires additional assumptions that are outside the scope of this article.   Table 8 reports the performance of the blended portfolios stated in terms of Calmar ratio.
Panel A reports the percentage of simulations that improve the Calmar ratio over the original 60-40 portfolio of stocks  and bonds. Panel B reports the mean Calmar ratios and Panel C reports the median Calmar ratios of the blended portfolios. The average Calmar ratio grows monotonically with additional allocation to CTA investments without reaching an intermediate peak as in the case of Sharpe ratios. However, the improvement comes with higher risk as indicated by declining percentages scenarios with superior Calmar ratios. Figure 6 shows the distribution of the out-of-sample Calmar ratios of the blended portfolios.
The optimal allocation choice depends on the specific preferences of individual investors and their aversion to risk. Investors who value average performance will tend to pay more attention to the means and medians of the performance distributions of the blended portfolios. By contrast, investors who are very risk averse will put more weight on the characteristics of the left tails.

CONCLUDING REMARKS
This article introduces a quantitative large-scale simulation framework for the robust and reliable evaluation of hedge fund investments with real life constraints by institutional investors. This methodology is implementable and incorporates common investment constraints when creating and rebalancing portfolios. The framework is customizable to the preferences and constraints of individual investors, investment objectives, rebalancing periods and the desired number of funds in a portfolio and can include a large number of portfolio construction approaches. Thus, the methodology can benefit portfolio managers, investment officers, board members and consultants who make hedge fund investment decisions.
As an illustration of the framework, we applied it to a subset of hedge funds in managed futures revealing a strikingly significant portfolio contribution of CTA investments to a typical 60-40 portfolio of stocks and bonds over the period from 1999 to 2014, though this contribution is much less significant during the exceptional period between 2009 and 2014, when the benchmark portfolio delivered a Sharpe ratio of 1.12. This finding is robust across a large set of parameters and all portfolio construction methodologies considered in the study. The empirical results suggest that equal-risk portfolios of CTAs outperform minimum risk approaches out-ofsample whether as standalone investments or as diversifiers to the investor's benchmark portfolio.
While the empirical findings can immediately benefit institutional investors who seek to enhance performance through better diversification and portfolio construction, this analysis is merely one application of the flexible large-scale simulation methodology that can be utilized more broadly to examine a large number of portfolio management techniques subject to real life constraints. evaluate performance using second order stochastic dominance which is particularly relevant because investors are often unaware of their own utility functions as reported in Elton and Gruber (1987). Levy and Sarnat (1970) and Fischmar and Peters (2006) suggest using stochastic dominance as an alternative to mean-variance analysis. 5. See Chekhlov et al (2005) for a formal definition of the maximum drawdown. It is typically defined as the largest peak-to-valley loss and represents a risk measure that is commonly used by practitioners. Calmar ratio is defined as the ratio of annualized excess return to the maximum drawdown. 6. Though in this paper marginal portfolio contribution is measured using Sharpe and Calmar ratios, in general it should be evaluated relative to the specific investment objectives of the investor. For example, a university endowment may target returns that exceed the university's spending rate over a market cycle. The framework can incorporate investor-specific performance metrics of marginal portfolio contribution. 7. According to the BarclayHedge Group which monitors assets under management, CTAs were managing $310 million in 1980, $10.5 billion in 1990 and $330 billion in the first quarter of 2015. 8. The results of the sub-sample analysis are available upon request. 9. For details, see Appendix A: Data cleaning. 10. Typically funds go through an incubation period during which they build a track record using proprietary capital. Fund managers choose to start reporting to a CTA database to raise capital from outside investors only if the track record is attractive and they are allowed to 'backfill' the returns generated before their inclusion in the database.
Since funds with poor performance are unlikely to report returns to the database, incubation/backfill bias results. 11. See Appendix B for technical definitions of the risk-based approaches. 12. See MBB for a detailed description of the hedge fund reporting delay. 13. The framework is flexiblethe number of funds in a portfolio, rebalancing frequency, AUM threshold levels and other parameters can be customized to reflect each investor's preferences and constraints. 14. Calmar ratio is defined as the ratio of the annualized excess return to the maximum historical drawdown. 15. The framework is flexible in comparing any two approaches to each other but requires performing additional bootstrapping simulations based on an investor's particular areas of interest. 16. Jensen's inequality states that Eg(X) ⩽ g(Ex) for any concave function g such as the Sharpe ratio. See Rudin (1986) for a detailed explanation of the Jensen's inequality. 17. See MBB for detailed examples of employing first and second order stochastic dominance to evaluate distributions of out-of-sample performance within a large-scale simulation framework. 18. The P-value is estimated by calculating the percentage of bootstrapped simulations of RANDOM portfolios that outperform the other portfolio methodologies for a given performance metric. For example, the P-value of 16 for EN in the Return category suggests that 16 per cent of bootstrapped simulations have a mean return that is higher than that of EN. Therefore, we fail to reject the hypothesis of RANDOM portfolios having a mean return that is lower than that of EN. That intuitively makes sense because random portfolios should have the same return as equal portfolios on average. We compare RANDOM portfolios to bootstrapped RANDOM portfolios for robustness. The P-values indicate that we cannot reject the hypothesis that the RANDOM portfolio is better or worse than the bootstrap RANDOM portfolios at any reasonable confidence level. 19. We have performed a sub-sample analysis to evaluate the marginal contribution of a 10 per cent allocation to managed futures during periods of relatively poor and relatively good performance. A 60-40 portfolio produced a Sharpe ratio of −0.09 and a Calmar ratio of −0.025 during the relatively poor period between 1999 and 2008. During the exceptional period between 2009 and 2014, the benchmark portfolio delivered a Sharpe ratio of 1.12 and a Calmar ratio of 0.77. During the 10-year period between 1999 and 2008, all portfolio construction approaches would have added value as measured in terms of average Sharpe and Calmar ratios with the equal-risk portfolios producing the best results. During the 6-year period between 2009 and 2014, the blended portfolios have approximately the same average Sharpe ratios and slightly better Calmar ratios than the benchmark portfolio with the minimum risk approaches delivering the best results. The sensitivity of relative results to an evaluation window is common. For example, Anderson et al (2012) compare four investment strategies and find that the specific start and end dates of a backtest can have a material impact on the results. This study introduces a methodology that can be used for the evaluation of portfolio construction approaches with real-life constraints and is strengthened by the sub-sample analysis. Results of the sub-sample analysis are available upon request. 20. Since we evaluate the role of luck in active management decisions, we consider that a passive 60-40 portfolio of stocks and bonds that utilizes the S&P 500 Total Return index and the JPM Global Government Bond Index has no luck associated with it.
where N is the number of funds in the portfolio and w i is the weight of fund i. 2. EVA allocation is similar to the EN approach but exposure to each fund is adjusted for the fund's volatility which is estimated using the standard deviation of its in-sample excess returns: 3. Classic RP is the solution to the following optimization problem: such that P N i¼1 w i ¼ 1, w i ⩾0, and σ ¼ ffiffiffiffiffiffiffiffiffiffi ffi w′Σw p represents portfolio volatility with Σ, the sample covariance matrix, calculated using the in-sample excess returns. 4. MV is the solution to the following optimization problem: Min w σ such that P N i¼1 w i ¼ 1, w i ⩾0.
5. MDEV is the solution to the following optimization problem: , and x j are the fund's monthly returns during the N-month in-sample period with j = 1,…, N. 6. Random portfolio (RANDOM) is used as a benchmark approach to portfolio allocation. First, a random number x i between 0 and 1 is generated. Then random portfolio weights are normalized by setting w i ¼ This work is licensed under a Creative Commons Attribution 3.0 Unported License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/ licenses/by/3.0/