A simulation-based methodology for evaluating hedge fund investments

This paper introduces a large scale simulation framework for evaluating hedge funds’ investments subject to the realistic constraints of institutional investors. The method is customizable to the preferences and constraints of individual investors, including investment objectives, performance benchmarks, rebalancing period and the desired number of funds in a portfolio and can incorporate a large number of portfolio construction and fund selection approaches. As a way to illustrate the methodology, we impose the framework on a subset of hedge funds in the managed futures space that contains 604 live and 1,323 defunct funds over the period 1993-2014. We then measure the out-of-sample performance of three hypothetical risk-parity portfolios and two hypothetical minimum risk portfolios and their marginal contributions to a typical 60-40 portfolio of stocks and bonds. We find that an investment in managed futures improves an investor’s performance regardless of portfolio construction methodology and that equal risk approaches are superior to minimum risk portfolios across all performance metrics considered in the study. Our paper is relevant for institutional investors in that it provides a robust and flexible framework for evaluating hedge fund investments given the specific preferences and constraints of individual investors.

This paper uses a simulation framework similar to MBB but applies it to the evaluation of portfolio construction approaches subject to real life constraints and uses additional performance metrics that evaluate the marginal portfolio contribution of a hedge fund portfolio to an investor's original portfolio. This study intentionally uses a few commonly used measures of performance to illustrate that the framework is not limited to a single measure. The framework is customizable to the preferences and constraints of individual investors regarding rebalancing periods and the desired number of funds in a portfolio and can incorporate a large number of portfolio construction and fund selection approaches. The methodology produces implementable results because it explicitly accounts for the hedge fund reporting delay reported in MBB and applies an in-sample/out-of-sample framework that incorporates common investment constraints when creating and rebalancing portfolios. The framework imposes the standard requirements of institutional investors regarding track record length and the amount of assets under management (AUM). It also limits the number and turnover of funds in the portfolio by assuming that the institutional investor selects a discrete number of funds that stay in the portfolio until they no longer satisfy selection criteria 5 . The methodology utilizes a simulation framework to account for a large number of feasible portfolio constituents in each period.
We evaluate out-of-sample performance with several commonly used measures of standalone performance and marginal portfolio contribution 6 . Standalone performance 5 Fund selection criteria can incorporate performance-based ranking as in Molyboga, Baek and Bilson (2015). 6 The framework is flexible and can incorporate customized performance measures selected by the investor. While the Fung-Hsieh (2001) five factor model is relevant for managed futures, the Fung-Hsieh eight factor model can be more appropriate for other types of hedge funds. MBB evaluate performance using second order stochastic dominance which is particularly relevant because investors are often unaware of their own utility functions as measures include annualized return, Sharpe and Calmar ratios, maximum drawdown 7 and the tstatistic of alpha with respect to the Fung-Hsieh (2001) five-factor model. We measure marginal portfolio contribution by evaluating the improvement in Sharpe and Calmar ratios 8 by replacing a modest 10% of the original investor's portfolio with a 10% allocation to a simulated hedge fund portfolio. In this paper, we consider a standard 60-40 portfolio of stocks and bonds as the original portfolio, but the framework is flexible to the choice of investor benchmark.
Standard statistical techniques are inappropriate for the evaluation of out-of-sample performance since simulation results are not independent, driven rather by the overlap in portfolio constituents across simulations. We apply the bootstrapping methodology of Efron (1979) and Efron and Gong (1983) to estimate the sampling properties of the test results and draw statistical inferences about the relative performance of portfolio methodologies. Opdyke (2007) introduces an analytic formula for the asymptotic distribution of Sharpe ratios under very general conditions that include non-normal distributions, time-varying volatility, and serial correlations. This approach is particularly powerful when applied to a single return series such as in asset allocation studies that use single indices for each asset class. For example, studies by Kat (2004), Lintner (1996), Abrams, Bhaduri and Flores (2009) and Chen, O'Neill and Zhu (2005) that demonstrate positive contribution of managed futures to traditional portfolios can reported in Elton and Gruber (1987). Levy and Sarnat (1970) and Fischmar and Peters (2006) suggest using stochastic dominance as an alternative to mean-variance analysis. 7 See Chekhlov, Uryasev and Zabarankin (2005) for a formal definition of the maximum drawdown. It is typically defined as the largest peak-to-valley loss and represents a risk measure that is commonly used by practitioners. Calmar ratio is defined as the ratio of annualized excess return to the maximum drawdown. 8 Though in this paper marginal portfolio contribution is measured using Sharpe and Calmar ratios, in general it should be evaluated relative to the specific investment objectives of the investor. For example, a university endowment may target returns that exceed the university's spending rate over a market cycle. The framework can incorporate investor-specific performance metrics of marginal portfolio contribution. directly benefit from utilizing the methodology introduced in Opdyke (2007). By contrast, the simulation methodology of this paper produces many time-series and we select a boostrapping approach because it can be applied to any performance measure while accounting for lack of independence in simulation results.
We impose the framework with 10,000 simulations on a dataset of 604 live and 1,323 defunct Commodity Trading Advisors (CTAs) over the period 1993-2014. Commodity Trading Advisors, a subset of hedge funds that has grown exponentially over the past 35 years 9 , is known for its historically strong performance during times of market crisis, notably the Financial Crisis of 2008, and, therefore, serves as a particularly interesting subset of hedge funds from a portfolio diversification perspective. We evaluate several popular risk-based approaches that include two minimum risk and three risk-parity methods. While the approaches we consider are commonly used by both practitioners and academics, they are only a few of the portfolio construction approaches that can be evaluated within the framework. The methodology can be extended to a large number of quantitative portfolio construction approaches.
Our results are striking because an investment in CTAs improves performance regardless of the choice of the portfolio construction approach. For the out-of-sample period between January 1999 and December 2014, a 10% allocation to managed futures improves the Sharpe ratio of the original 60-40 portfolio of stocks and bonds from 0.376 to 0.399-0.416 on average, depending on the portfolio construction methodology employed. Similarly, the Calmar ratio improves from 0.092 to 0.100-0.108 on average. Blended portfolios have higher Sharpe ratios in at least 89% of simulations and higher Calmar ratios in at least 89.5% of simulations. Our findings are consistent with Kat (2004), Lintner (1996), Abrams, Bhaduri and Flores (2009) and Chen, O'Neill and Zhu (2005) that report a positive contribution of managed futures to traditional portfolios.
Minimum risk portfolios perform the worst for all performance metrics. For example, their average Sharpe ratios are between 0.299 and 0.304, significantly lower from both an economic and statistical perspective than the 0.319 average Sharpe ratio of the random portfolios. By contrast, equal risk methodologies deliver superior average Sharpe ratios of 0.342 to 0.362. Our results are consistent with DeMiquel, Garlappi and Uppal (2009) who find that an equal notional allocation (EN), which we consider an equal-risk approach, is superior to a minimum variance allocation, which we consider a minimum risk approach.
We have performed a sub-sample analysis to evaluate the marginal contribution of a 10% allocation to managed futures during periods of relatively poor and relatively good performance. A 60-40 portfolio produced a Sharpe ratio of -0.09 and a Calmar ratio of -0.025 during the relatively poor period between 1999 and 2008. During the exceptional period between 2009 and 2014, the benchmark portfolio delivered a Sharpe ratio of 1.12 and a Calmar ratio of 0.77 10 . During the 10-year period between 1999 and 2008, all portfolio construction approaches would have added value as measured in terms of average Sharpe and Calmar ratios with the equal-risk portfolios producing the best results. During the 6-year period between 2009 and 2014, the blended portfolios have approximately the same average Sharpe ratios and slightly better Calmar ratios than the benchmark portfolio with the minimum risk approaches 10 The results of the sub-sample analysis are available upon request.
delivering the best results. The sensitivity of relative results to an evaluation window is common. For example, Anderson, Bianchi and Goldberg (2012) compare four investment strategies and find that the specific start and end dates of a backtest can have a material impact on the results. This study introduces a methodology that can be used for the evaluation of portfolio construction approaches with real-life constraints and is strengthened by the subsample analysis.
Our findings and methodology are relevant for institutional investors who might consider investing, or who are already currently invested, in hedge funds and managed futures because the framework can be customized to the specific preferences and constraints of investors to maximize the benefits of hedge fund portfolios.
The remainder of the paper is organized as follows: Section I describes the data and accounts for biases; Section II discusses the risk-based approaches and introduces the largescale simulation framework; Section III presents empirical out-of-sample results; and Section IV concludes.

I. Data
There are several commonly used CTA databases: BarclayHedge; CISDM (formerly the MAR database); Lipper (formerly TASS); and Eurekahedge. Joenvaara, Kosowski and Tolonen (2012) perform a comprehensive study of publicly available databases of hedge fund returns and report that Barclay Hedge provides the highest quality data out of the databases considered. Moreover, the BarclayHedge database is the largest publicly available database of Commodity Trading Advisors with 1,013 active and 3,660 defunct funds over the period from December 1993 to December 2014. Therefore, we use BarclayHedge for this study as it is the most comprehensive and highest quality publicly available database of CTA returns.
We perform a number of filtering steps to ensure data quality and limit the scope of the study to the funds that would be appropriate for institutional investors who are interested in making direct investments. We explicitly account for the survivorship, backfill, incubation and liquidation biases that are common within CTA and hedge fund databases 11 . We include the graveyard database that contains defunct funds to account for the survivorship bias. The backfill and incubation biases arise due to the voluntary nature of self-reporting 12 . We use a combination of two approaches to mitigate these biases. The first methodology, suggested by Fama and French (2010), limits the tests to those funds that managed at least US $10 million in AUM normalized to December 2014 values. Once a fund reaches the AUM minimum, it is included in all subsequent tests to avoid creating selection bias. Unfortunately, many CTAs, including very successful and established ones, originally reported only net returns for an extended period of time prior to their initial inclusion of AUM data. Using Fama and French (2010) methodology exclusively would completely eliminate large portions of valuable data for such funds. To include this data, we apply the technique suggested by Kosowski, Naik and Teo (2007), which eliminates only the first 24 months of data for such funds. We use the liquidation bias estimate of 1% as suggested in Ackermann, McEnally and Ravenscraft (1999). After 11 For details, see Appendix A: Data cleaning. 12 Typically funds go through an incubation period during which they build a track record using proprietary capital. Fund managers choose to start reporting to a CTA database to raise capital from outside investors only if the track record is attractive and they are allowed to "backfill" the returns generated prior to their inclusion in the database. Since funds with poor performance are unlikely to report returns to the database, incubation/backfill bias results. accounting for the biases, our dataset includes 604 live and 1,323 defunct funds for the period between December 1995 and December 2014.
We use the Fung-Hsieh five factor model of primitive trend following systems, introduced in Fung and Hsieh (2001) Reserve System serves as a proxy for the risk-free rate. Table I reports summary statistics and tests of normality, heteroscedasticity and serial correlations in CTA returns by strategy and current status.
<Put Table I here> Anson (2011) suggests that the 60-40 portfolio of stocks and bonds represents a typical starting point for a US institutional investor. In this paper, this blend is constructed using the S&P 500 Total Return index and the JPM Global Government Bond Index. Table II reports  <Put Figure 1 here> Although the 60-40 portfolio of stocks and bonds has been used extensively in the literature as a benchmark portfolio, the framework is flexible and can incorporate any investor-specific portfolio as a benchmark.

II. Methodology
In this section, we define the risk-based approaches considered in this study. Then we introduce a large-scale simulation framework with real-life constraints used to generate out-ofsample portfolio returns. Finally, we describe the performance metrics used to compare outof-sample results.

A. Review of risk-based approaches
In this paper, we evaluate two minimum risk and three equal-risk (or risk-parity) approaches 13 .
While the approaches we consider are commonly used by practitioners and academics, they are used merely as examples of portfolio construction approaches that can be evaluated within the framework. The methodology can be extended to a large number of quantitative portfolio construction approaches. Minimum risk portfolios include the minimum variance (MV) approach with non-negative constraints documented in Jagannathan and Ma (2003) and a minimum semi-standard deviation (MDEV) approach that is similar to the minimum variance approach but only considers negative returns. Equal-risk, or risk-parity, approaches include an equal notional (EN) approach, which is a naïve diversification 1/N method praised in DeMiquel, Garlappi and Uppal (2009)  (RP) approach extensively discussed in Maillard, Roncalli and Teiletche (2010), Clarke, Silva and Thorley (2013) and Qian (2013). We apply a random portfolio selection approach (Random) that serves as a benchmark in evaluating the risk parity approaches. The approaches are evaluated using a large-scale simulation framework with real life constraints.

B. Large scale simulation framework
In this paper, we utilize a modification of the large-scale simulation framework with real life constraints introduced in MBB. MBB apply the framework to evaluate persistence in hedge fund managers' performance and compare equally-weighted portfolios of funds that rank in the top quintile based on the t-statistic of alpha with respect to a CTA benchmark (restrictive fund selection) against those of all available funds (random fund selection). By contrast, this paper does not impose any ranking but rather focuses on the impact of choice of portfolio construction methodology on performance. The out-of-sample period is between January 1999 and December 2014, the longest out-of-sample backtesting period in CTA empirical research.
The framework uses 10,000 simulations and a lag of one month to account for the delay in the performance reporting of CTAs 14 . Below we describe a single run of the simulation framework and then show how simulation results are evaluated.

i)
A single run of the simulation framework The in-sample/out-of-sample framework mimics the actions of an institutional investor who makes allocation decisions at the end of each month.  (2007)

III. Empirical out-of-sample results.
17 The framework is flexible in comparing any two approaches to each other but requires performing additional bootstrapping simulations based on an investor's particular areas of interest.
In this section, we present information about the dataset used in the simulation and out-ofsample results for the period between January 1999 and December 2014 generated by the large-scale simulation framework. A. Analysis of out-of-sample performance of CTA portfolios as standalone investments We analyze distributions of out-of-sample returns over the complete data period using means and medians of several performance metrics. Since simulations are not independent, we use a bootstrapping methodology to draw statistical inferences about the relative performance of portfolio construction approaches.
i) Distributions of out-of-sample performance While Table IV presents mean and median values of several performance metrics, a complete evaluation of the portfolio construction methodologies should also consider distributions of out-of-sample performance. Figure 2 shows the distributions of Sharpe generated by the large-scale simulation framework for each portfolio methodology.

<Figure 2>
Each distribution is visualized using a standard box and whisker plot with the box containing the middle two quartiles, the thick line inside the box representing the median of the distribution and the whiskers displayed at the top and bottom 5 percent of the distribution. The breadth of each distribution demonstrates the key benefit of using a large-scale simulation framework.
Failing to account for the role of chance and evaluating portfolio construction techniques using a single stream, which represents a single draw of the distribution, can mislead investors about the relative performance of portfolio management techniques. Since the distributions are so wide, it might seem impossible to compare them to each other. Fortunately, it is not a new problem in Quantitative Finance and Decision Theory where expected utility and stochastic dominance methodologies are applied to compare distributions. The framework is flexible and can employ utility functions and stochastic dominance to evaluate results; however, this paper only considers means and medians for the sake of brevity 19 . The minimum risk approaches, MV and MDEV, have the lowest median Sharpe and exhibit relatively large left tails. The equal risk approaches seem to perform better on average than the random portfolio methodology, but it is difficult to determine whether that relative performance is statistically significant, particularly since the standard statistical techniques are inappropriate due to dependence across simulation results. Therefore, we apply a bootstrapping procedure to estimate sampling distributions of the performance measures.

<Figure 3>
The minimum risk approaches, MV and MDEV, underperform on average whereas the equal risk approaches, EN, EVA and RP, seem to outperform the RANDOM portfolio. 20 The p-value is estimated by calculating the percentage of bootstrapped simulations of RANDOM portfolios that outperform the other portfolio methodologies for a given performance metric. For example, the p-value of 16% for EN in the Return category suggests that 16% of bootstrapped simulations have a mean return that is higher than that of EN. Therefore, we fail to reject the hypothesis of RANDOM portfolios having a mean return that is lower than that of EN. That intuitively makes sense because random portfolios should have the same return as equal portfolios on average. We compare RANDOM portfolios to bootstrapped RANDOM portfolios for robustness. The p-values indicate that we cannot reject the hypothesis that the RANDOM portfolio is better or worse than the bootstrap RANDOM portfolios at any reasonable confidence level.
We utilize the Fung-Hsieh factor model introduced in Fung and Hsieh (2001) to account for the systematic risk exposures of hypothetical portfolios that might drive the above results. Table V    Analysis of marginal performance contribution is important, particularly when an investor already has exposure to a large number of systematic sources of return in his or her welldiversified portfolio. In that situation, strategies that harvest the same sources of return can look very attractive as standalone investments but do not improve the risk-adjusted return of the investor's portfolio. The framework employed here is flexible and can utilize an investor's existing portfolio as a benchmark against which the marginal contribution of hedge fund portfolios can be measured. 21 We have performed a sub-sample analysis to evaluate the marginal contribution of a 10% allocation to managed futures during periods of relatively poor and relatively good performance. A 60-40 portfolio produced a Sharpe ratio of -0.09 and a Calmar ratio of -0.025 during the relatively poor period between 1999 and 2008. During the exceptional period between 2009 and 2014, the benchmark portfolio delivered a Sharpe ratio of 1.12 and a Calmar ratio of 0.77 21 . During the 10-year period between 1999 and 2008, all portfolio construction approaches would have added value as measured in terms of average Sharpe and Calmar ratios with the equal-risk portfolios producing the best results. During the 6-year period between 2009 and 2014, the blended portfolios have approximately the same average Sharpe ratios and slightly better Calmar ratios than the benchmark portfolio with the minimum risk approaches delivering the best results. The sensitivity of relative results to an evaluation window is common. For example, Anderson, Bianchi and Goldberg (2012) compare four investment strategies and find that the specific start and end dates of a backtest can have a material impact on the results. This study introduces a methodology that can be used for the evaluation of portfolio construction approaches with real-life constraints and is strengthened by the sub-sample analysis. Results of the sub-sample analysis are available upon request. ii) The impact of the size of the allocation to CTA portfolios on the performance of blended portfolios.
By evaluating the impact of allocation weights on performance, the framework can be used to optimally allocate to hedge fund portfolios given an investor's specific preferences and constraints. This study considers the performance of blended portfolios that have allocations between 5% and 60% to CTA investments. Table VII reports the performance of blended portfolios stated in terms of Sharpe ratio. Panel A reports the percentage of simulations that improves the Sharpe ratio over the original 60-40 portfolio of stocks and bonds. Panel B reports mean Sharpe ratios and Panel C reports median Sharpe ratios of the blended portfolios.

<Table VII>
Average Sharpe ratios increase until the allocation to CTA portfolios reaches 40-50% and declines thereafter. However, the improvement that comes with a higher allocation to CTA portfolios also comes with a higher risk. While a minimum variance portfolio improves the Sharpe ratio of the investor portfolio in 89.6% of scenarios with a 5% allocation to CTA portfolios, that number declines to 74% at a 60% allocation level. Similarly, the percentage of positive contribution scenarios declines from 98.7% to 81.6% for the equal notional approach as the allocation to CTA investments grows from 5% to 60%. Figure 5 shows the distribution of the out-of-sample Sharpe ratios of the blended portfolios.

<Figure 5>
It is important to note that the framework implicitly assumes that the performance of the investor's original portfolio can be expressed by a single time series or a single outcome, completely ignoring the role of luck due to active management decisions in the investor's portfolio 22 . A joint simulation of the investor's portfolio management techniques applied to the original portfolio constituents and the hedge fund portfolios has the potential to better account for luck in both types of investments but requires additional assumptions that are outside the scope of this paper. Table VIII reports the performance of the blended portfolios stated in terms of Calmar ratio.
Panel A reports the percentage of simulations that improve the Calmar ratio over the original 60-40 portfolio of stocks and bonds. Panel B reports the mean Calmar ratios and Panel C reports the median Calmar ratios of the blended portfolios.

<Table VIII>
The average Calmar ratio grows monotonically with additional allocation to CTA investments without reaching an intermediate peak as in the case of Sharpe ratios. However, the improvement comes with higher risk as indicated by declining percentages scenarios with superior Calmar ratios. Figure 6 shows the distribution of the out-of-sample Calmar ratios of the blended portfolios.

<Figure 6>
The optimal allocation choice depends on the specific preferences of individual investors and their aversion to risk. Investors who value average performance will tend to pay more attention to the means and medians of the performance distributions of the blended portfolios.
By contrast, investors who are very risk averse will put more weight on the characteristics of the left tails.

IV. Concluding remarks.
This paper introduces a quantitative large-scale simulation framework for the robust and reliable evaluation of hedge fund investments with real life constraints by institutional investors. This methodology is implementable and incorporates common investment constraints when creating and rebalancing portfolios. The framework is customizable to the preferences and constraints of individual investors, investment objectives, rebalancing periods and the desired number of funds in a portfolio and can include a large number of portfolio construction approaches. Thus, the methodology can benefit portfolio managers, investment officers, board members and consultants who make hedge fund investment decisions.
As an illustration of the framework, we applied it to a subset of hedge funds in managed futures revealing a strikingly significant portfolio contribution of CTA investments to a typical 60-40 portfolio of stocks and bonds over the period from 1999 to 2014, though this contribution is much less significant during the exceptional period between 2009 and 2014, when the benchmark portfolio delivered a Sharpe ratio of 1.12. This finding is robust across a large set of parameters and all portfolio construction methodologies considered in the study.
The empirical results suggest that equal-risk portfolios of CTAs outperform minimum risk approaches out-of-sample whether as standalone investments or as diversifiers to the investor's benchmark portfolio.
While the empirical findings can immediately benefit institutional investors who seek to enhance performance through better diversification and portfolio construction, this analysis is merely one application of the flexible large-scale simulation methodology that can be utilized more broadly to examine a large number of portfolio management techniques subject to real life constraints.

Appendix A. Data Cleaning.
After excluding all funds from the BarclayHedge database that are multi-advisors or benchmarks, we select only those funds that report returns net of all fees for the period between December 1993 and December 2014. Our study considers 4,673 funds with 1,013 active and 3,660 defunct funds. We performed a few additional data filtering procedures to improve data quality and make the results practical for institutional investors. First, we eliminated null returns at the end of the track records of defunct fund. Then we excluded managers with less than 24 months of data which limited the data set to 3,223 funds.
Additionally, we eliminated all funds with maximum assets under management of less than US $10 million which further limited the data set to 1,937 funds. Finally, we excluded funds with one or more monthly return in excess of 100% which resulted in the final pool of 1,927 funds of which 604 were live and 1,323 were defunct.

Appendix B. Risk-based allocation approaches
In this study we consider three equal-risk and two minimum risk approaches. They include equal notional (EN), equal volatility-adjusted (EVA), classic risk-parity (RP), minimum variance (MV) and minimum downside deviation (MDEV) methodologies. 1) Equal notional (EN) allocation is a simple equal weight (or naïve diversification) approach: where N is the number of funds in the portfolio and is the weight of fund i.
2) Equal volatility-adjusted (EVA) allocation is similar to the equal notional approach but exposure to each fund is adjusted for the fund's volatility which is estimated using the standard deviation of its in-sample excess returns: 3) Classic risk-parity (RP) is the solution to the following optimization problem: First, a random number between 0 and 1 is generated. Then random portfolio weights are normalized by setting = ∑ =1 .     This table reports the results of a marginal contribution analysis. The original investor portfolio is represented by a 60-40 portfolio of stocks and bonds. It has delivered a Sharpe ratio of 0.376 and a Calmar ratio of 0.092 over the period 1999-2014. The first column presents the Sharpe ratio of a blended portfolio that replaces 10% of the allocation to the original portfolio with 10% of the CTA portfolios constructed in the simulation framework. The second column reports the Calmar ratio of the blended portfolios. The third and fourth columns report the percentage of time the blended portfolios have higher Sharpe and Calmar ratios than those of the original portfolio. Panel A reports mean values, Panel B displays median values.