## Abstract

In this article, we investigate the impact of truncating training data when fitting regression trees. We argue that training times can be curtailed by reducing the training sample without any loss in out-of-sample accuracy as long as the prediction model has been trained on the *tails* of the dependent variable, that is, when ‘average’ observations have been discarded from the training sample. Filtering instances has an impact on the features that are selected to yield the splits and can help reduce overfitting by favoring predictors with monotonous impacts on the dependent variable. We test this technique in an out-of-sample exercise of portfolio selection which shows its benefits. The implications of our results are decisive for time-consuming tasks such as hyperparameter tuning and validation.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

## Notes

One reason why monotonous patterns matter is that they are robust, hence an investor can confidently rely on them to build portfolios. However, in practice, monotonicity tests are seldom carried out even though they exist. Romano and Wolf (2013) detail one such procedure for instance. For simplicity, so-called factors are often determined by looking at differences related to extreme values of features. The whole distribution of features is rarely exploited for at least two reasons. First, it is less straightforward and harder to assess. Second, it is likely to reduce the significance of factors. Because of the publication bias towards positive results, academics seek statistical significance and thus ignore tests that are likely to curtail it.

The material can be accessed here: www.gcoqueret.com/tot.html.

In this particular example, allocating to highly volatile stocks resembles a lottery bet and is as such very risky indeed.

The multivariate analysis involving

*all*features (and how they compete) will come later in the paper, from Sect. 5.2 onwards.It is generally admitted that monotonous transformations of predictive variables do not substantially alter the structure of the tree, or, more generally, of the predictions of other ML engines. We resort to this technique for analytical tractability, but we highlight in all transparency that it can have non-negligible effects, as is shown in Galili and Meilijson (2016). Furthermore, we recall that normalisations and scaling procedures are common practice in machine learning: it is known that neural networks perform better when all variables have the same scales: either they are located inside a bounded interval (usually \([-1,1]\) or [0, 1]), or they are standardized to reach unit variance. Finally, it can also be argued that the prior transformation of data points could be directly embedded in the definition of the generator

*g*. In any case, such scaling procedures are now commonplace in recent in asset pricing theory contributions: we refer to Kelly et al. (2019) and Koijen and Yogo (2019) to name but a few.This is justified by Property 1 of Corollary 1. In the general case, \(g(x)=a_2x^2+a_1x+a_0\) with \(a_2\ne 0\), the function

*h*in (7) is a polynomial of degree four and thus has four roots: 0, 1 and$$\begin{aligned} x^\pm =\frac{a_2-3a_1\pm \sqrt{17a_2+18a_1a_2+9a_1}}{8a_2}, \end{aligned}$$(8)where the term inside the square root is positive over the \(\mathbb {R}^2\) plane. Notably, \(a_0\) plays not role in the splitting location, which makes sense. For the sake of simplicity, we choose to reduce the study to a one-parameter family.

The full details of the derivation are available upon request.

More precisely, the calculation is that of the loess (locally estimated scatterplot smoothing) function.

This interval leads to retain between 30 and 50% of the original sample. These figures are those that we use in practice because they keep between one half and one third of the original data, which is reasonable.

For a review on asset pricing anomaly detection, we refer to Goyal (2012).

Trees usually have depths between 3 and 6.

It is common practice in the money management industry to fix the size of the portfolio policies.

The data and the code can be accessed here: www.gcoqueret.com/tot.html

In linear models, the problem of regressing monthly returns on autocorrelated predictors is well documented since the seminal work of Stambaugh (1999). To circumvent this problem, we resort to a dependent variable that behaves like the predictors. We hope to unveil patterns that will be long-lasting, i.e., that will hold out-of-sample.

An important alternative is the

*random*filter. Nonetheless, it is much more expensive empirically because the performance is strongly dependent on the random seed. Robust results require at least 100 iterations of each backtest. In (partial) unreported results, we document that the random filter is inferior to the extreme filter, especially if the filter is intense.We saw in the second case in the previous proof that if the first term of (30) is zero, then

*V*is maximized.

## References

Ali, Ö. G., & Yaman, K. (2013). Selecting rows and columns for training support vector regression models with large retail datasets.

*European Journal of Operational Research*,*226*(3), 471–480.Ammann, M., Coqueret, G., & Schade, J.-P. (2016). Characteristics-based portfolio choice with leverage constraints.

*Journal of Banking & Finance*,*70*, 23–37.Asness, C., Frazzini, A., Israel, R., Moskowitz, T. J., & Pedersen, L. H. (2018). Size matters, if you control your junk.

*Journal of Financial Economics*,*129*(3), 479–509.Asness, C. S., Frazzini, A., Israel, R., & Moskowitz, T. J. (2014a). Fact, fiction and momentum investing.

*Journal of Portfolio Management*,*40*(5), 75–92.Asness, C. S., Frazzini, A., & Pedersen, L. H. (2014b). Quality minus junk.

*Review of Accounting Studies*,*24*(1), 1–79.Baker, M., Bradley, B., & Taliaferro, R. (2014). The low-risk anomaly: A decomposition into micro and macro effects.

*Financial Analysts Journal*,*70*(2), 43–58.Ballings, M., Van den Poel, D., Hespeels, N., & Gryp, R. (2015). Evaluating multiple classifiers for stock price direction prediction.

*Expert Systems with Applications*,*42*(20), 7046–7056.Barroso, P., & Santa-Clara, P. (2015). Momentum has its moments.

*Journal of Financial Economics*,*116*(1), 111–120.Bertsimas, D., & Dunn, J. (2017). Optimal classification trees.

*Machine Learning*,*106*(7), 1039–1082.Brandt, M. W., Santa-Clara, P., & Valkanov, R. (2009). Parametric portfolio policies: Exploiting characteristics in the cross-section of equity returns.

*Review of Financial Studies*,*22*(9), 3411–3447.Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. (1984).

*Classification and regression trees*. London: Chapman and Hall.Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods.

*Computers & Electrical Engineering*,*40*(1), 16–28.Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In

*Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining*(pp. 785–794). ACM.Chou, P. A. (1991). Optimal partitioning for classification and regression trees.

*IEEE Transactions on Pattern Analysis & Machine Intelligence*,*4*, 340–354.Daniel, K., & Moskowitz, T. J. (2016). Momentum crashes.

*Journal of Financial Economics*,*122*(2), 221–247.DeMiguel, V., Garlappi, L., & Uppal, R. (2007). Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy?

*Review of Financial Studies*,*22*(5), 1915–1953.Esposito, F., Malerba, D., & Semeraro, G. (1997). A comparative analysis of methods for pruning decision trees.

*IEEE Transactions on Pattern Analysis & Machine Intelligence*,*19*(5), 476–491.Fama, E. F., & French, K. R. (1992). The cross-section of expected stock returns.

*Journal of Finance*,*47*(2), 427–465.Fama, E. F., & French, K. R. (2015). A five-factor asset pricing model.

*Journal of Financial Economics*,*116*(1), 1–22.Fu, X., Du, J., Guo, Y., Liu, M., Dong, T., & Duan, X. (2018). A machine learning framework for stock selection. Preprint arXiv:1806.01743.

Galili, T., & Meilijson, I. (2016). Splitting matters: how monotone transformation of predictor variables may improve the predictions of decision tree models. Preprint arXiv:1611.04561.

Goyal, A. (2012). Empirical cross-sectional asset pricing: a survey.

*Financial Markets and Portfolio Management*,*26*(1), 3–38.Green, J., Hand, J. R., & Zhang, X. F. (2013). The supraview of return predictive signals.

*Review of Accounting Studies*,*18*(3), 692–730.Gu, S., Kelly, B. T., & Xiu, D. (2019). Empirical asset pricing via machine learning.

*Review of Financial Studies*(Forthcoming)Guida, T., & Coqueret, G. (2018). Ensemble learning applied to quant equity: Gradient boosting in a multifactor framework. In

*Big data and machine learning in quantitative investment*(pp. 129–148). Wiley.Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection.

*Journal of Machine Learning Research*,*3*(Mar), 1157–1182.Harvey, C. R. (2017). Presidential address: The scientific outlook in financial economics.

*Journal of Finance*,*72*(4), 1399–1440.Harvey, C. R., Liu, Y., & Zhu, H. (2016).... and the cross-section of expected returns.

*Review of Financial Studies*,*29*(1), 5–68.Hastie, T., Tibshirani, R., & Friedman, J. (2009).

*The elements of statistical learning*. Berlin: Springer.Huck, N. (2019). Large data sets and machine learning: Applications to statistical arbitrage.

*European Journal of Operational Research*,*278*(1), 330–342.Kelly, B. T., Pruitt, S., & Su, Y. (2019). Characteristics are covariances: A unified model of risk and return.

*Journal of Financial Economics*,*134*(3), 501–524.Koijen, R. S., & Yogo, M. (2019). A demand system approach to asset pricing.

*Journal of Political Economy*,*127*(4), 1475–1515.Köksalan, M., & Şakar, C. T. (2016). An interactive approach to stochastic programming-based portfolio optimization.

*Annals of Operations Research*,*245*(1–2), 47–66.Krauss, C., Do, X. A., & Huck, N. (2017). Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500.

*European Journal of Operational Research*,*259*(2), 689–702.Kuhn, M., & Johnson, K. (2019).

*Feature engineering and selection: A practical approach for predictive models*. London: Chapman & Hall/CRC.Linnainmaa, J. T., & Roberts, M. R. (2018). The history of the cross-section of stock returns.

*Review of Financial Studies*,*31*(7), 2606–2649.Liu, H., & Motoda, H. (2012).

*Feature selection for knowledge discovery and data mining*(Vol. 454). Berlin: Springer.Loh, W.-Y., et al. (2009). Improving the precision of classification trees.

*The Annals of Applied Statistics*,*3*(4), 1710–1737.Lopez, O., Milhaud, X., Thérond, P.-E., et al. (2016). Tree-based censored regression with applications in insurance.

*Electronic Journal of Statistics*,*10*(2), 2685–2716.Moritz, B., & Zimmermann, T. (2016). Tree-based conditional portfolio sorts: The relation between past and future stock returns.

*SSRN Working Paper**2740751*.Norouzi, M., Collins, M., Johnson, M. A., Fleet, D. J., & Kohli, P. (2015). Efficient non-greedy optimization of decision trees. In

*Advances in neural information processing systems*(pp. 1729–1737).Novy-Marx, R. (2012). Is momentum really momentum?

*Journal of Financial Economics*,*103*(3), 429–453.Pätäri, E., & Leivo, T. (2017). A closer look at value premium: Literature review and synthesis.

*Journal of Economic Surveys*,*31*(1), 79–168.Patel, J., Shah, S., Thakkar, P., & Kotecha, K. (2015). Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques.

*Expert Systems with Applications*,*42*(1), 259–268.Romano, J. P., & Wolf, M. (2013). Testing for monotonicity in expected asset returns.

*Journal of Empirical Finance*,*23*, 93–116.Sasane, A. (2016).

*Optimization in function spaces*. Mineola: Courier Dover Publications.Settles, B. (2012). Active learning.

*Synthesis Lectures on Artificial Intelligence and Machine Learning*,*6*(1), 1–114.Stahl, F., & Bramer, M. (2012). Jmax-pruning: A facility for the information theoretic pruning of modular classification rules.

*Knowledge-Based Systems*,*29*, 12–19.Stambaugh, R. F. (1999). Predictive regressions.

*Journal of Financial Economics*,*54*(3), 375–421.Van Dijk, M. A. (2011). Is size dead? A review of the size effect in equity returns.

*Journal of Banking & Finance*,*35*(12), 3263–3274.Yan, Z., Chen, Z., Consigli, G., Liu, J., & Jin, M. (2019). A copula-based scenario tree generation algorithm for multiperiod portfolio selection problems.

*Annals of operations research*(pp. 1–33).

## Acknowledgements

The authors thank an anonymous referee for his/her precious comments that have improved the clarity of the paper.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendices

### Proofs

### 1.1 Proof of Proposition 1

The expressions of the conditional means are

Likewise, the total dispersion defined in (4) is equal to

where

Obviously, we work on the asymptotic expressions for analytical tractability. In the sequel, we will drop the *N* scaling factor in (16) for notational simplicity. We re-arrange \(\pi ^-\) and \(\pi ^+\) via the conditional density:

where in the third line we have used (2). By the Leibniz integral rule for differentiation, we have

with

Hence, after multiple simplifications the first order derivative satisfies

In (21) and (22), the integrals on the right are equal to zero, which leads to

and

In the equivalence above, we have relied on \(f_X(c)>0\). Indeed, support of \(f_X\) is an interval and on this interval it is assumed that \(f_X>0\). The spit must occur at some point inside the support of \(f_X\) (i.e., the input data), hence \(f_X(c)> 0\). Indeed, a split cannot by definition be located at a point to the left or to the right of *all* of the points \(X_i\).

We discuss the second order condition below. The second order derivative is obtained by differentiating (23):

where we write \(f'\) for the derivative of *f*. When the first order condition is met, we have the three cases from Eq. (24):

*Case*\(m^-(c)+m^+(c)=2g(c)\) and \(m^-(c)-m^+(c)\ne 0\): we replace*g*(*c*) in (25) and (28)$$\begin{aligned} \frac{\partial ^2}{\partial c^2}V(c)&= -2f_X(c)g'(c)\left( m^-(c)-m^+(c)\right) \nonumber \\&\quad + f_X(c) [m^-(c)-m^+(c)]\left( \frac{\partial }{\partial c}m^-(c)+\frac{\partial }{\partial c}m^+(c)\right) , \end{aligned}$$(29)which implies that \((m^-(c)-m^+(c))\left( \frac{\partial }{\partial c}m^-(c)+\frac{\partial }{\partial c}m^+(c)-2g'(c) \right) \ge 0\) is required to achieve minimisation. Plugging \(g(c)=(m^-(c)+m^+(c))/2\) in (19) and (20) further simplifies this condition to

$$\begin{aligned}&(m^-(c)-m^+(c))\left[ f_X(c)\frac{m^+(c)-m^-(c)}{2}\left( \left( \int _{-\infty }^cf_X(y)dy\right) ^{-1}\right. \right. \\&\quad \left. \left. +\left( \int _{c}^\infty f_X(y)dy\right) ^{-1} \right) -2g'(c)\right] \ge 0\\&\quad \Leftrightarrow \quad -f_X(c)\frac{(m^+(c)-m^-(c))^2}{2}\\&\quad -2(m^-(c)-m^+(c))\left( \int _{-\infty }^cf_X(y)dy\right) \left( \int _{c}^\infty f_X(y)dy\right) g'(c) \ge 0 \end{aligned}$$In all generality it seems impossible to go further than this inequality but we discuss a special case in the subsequent proof.

*Case*\(m^-(c)-m^+(c)= 0\) and \(2g(c)\ne (m^-(c)+m^+(c))\): lines (25) and (27) vanish so$$\begin{aligned} \frac{\partial ^2}{\partial c^2}V(c)&=2f_X(c)\left( m^-(c)\frac{\partial }{\partial c}m^-(c)-m^+(c)\frac{\partial }{\partial c}m^+(c) \right) \\&\quad -2f_X(c)g(c)\left( \frac{\partial }{\partial c}m^-(c)-\frac{\partial }{\partial c}m^+(c) \right) \\&= 2f_X(c)(m^-(c)-g(c)) \frac{\partial }{\partial c}m^-(c)-2f_X(c)(m^+(c)-g(c)) \frac{\partial }{\partial c}m^+(c) \\&= -2[f_X(c)(m^-(c)-g(c))]^2 \left( \int _{-\infty }^cf_X(y)dy\right) ^{-1}\\&\quad -2[f_X(c)(m^+(c)-g(c))]^2 \left( \int _{c}^\infty f_X(y)dy\right) ^{-1} \end{aligned}$$where we have used expressions (19) and (20). This last line is always negative, thereby implying that the split is

*maximising*the dispersion function*V*. It can thus be ruled out.*Case*\(m^-(c)-m^+(c)= 0\) and \(2g(c)= (m^-(c)+m^+(c))\): by (29), this leads to \(\frac{\partial ^2}{\partial c^2}V(c)=0\), which is inconclusive. This case is clearly degenerate as it implies \(g(c)=m^-(c)=-m^+(c)\) by adding or subtracting the two conditions. According to the second condition, all three terms must be equal to zero. We rule this case out.

### 1.2 Proof of Proposition 2

First, when \(f_X(x)=1_{\{ x\in [0,1]\}}\), we have \(m^-(c)-m^+(c)=\frac{\int _0^cg(y)dy-c\int _0^1g(y)dy}{c(1-c)}\) and \(m^-(c)+m^+(c)-2g(c)=\frac{h(c)}{c(1-c)}\), where *h* is defined in (7). Hence, by (23),

In addition, *g* is bounded on [0, 1], thus after some simplifications,

This implies that *V* starts by decreasing to the right of zero, and ends by increasing to the left of one. Because \(h(0)=h(1)=0\), it holds that \(V(0)=V(1)=0\). Hence, if there exists at least one root to \(\frac{\partial }{\partial c}V(c)=0\) in (0, 1), then one of them minimizes *V* over this interval.

Next, we list some properties of *h* defined in (7) because the solution to \(\frac{\partial }{\partial c}V(c)=0\) with \(\frac{\partial ^2}{\partial c^2}V(c)\ge 0\) reduces to that of \(h(c)=0\).^{Footnote 18} Plainly,

so that \(h'(0)=\int _0^1g(y)dy-g(0)\) and \(h'(1)=g(1)-\int _0^1g(y)dy\), as well as \(h''(0)=-3g'(0)\) and \(h''(1)=3g'(1)\). Under Condition 1, both \(h'(0)\) and \(h'(1)\) have the same sign. Hence, both near 0 and 1, *h* has the same monotonous behaviour (increasing or decreasing). Since \(h(0)=h(1)=0\), this requires that \(h'\) switches signs at least twice. Hence the solution to (7) does exist and is located between the smallest root and largest root of \(h'(x)=0\) contained in (0, 1).

### 1.3 Proof of Proposition 3

We start with the first statement. First of all, from Proposition 1, we have that the optimal split is not affected by the noise structure. Second, the dispersion term *V*(*c*) in 4 can be re-written as

where the terms \(m^\pm (c)\) depend on *X* only. The numerator in (5) is

so that any gain has form

which does not depend on \(\sigma _E^2\).

We then switch to the second statement. We recall \(m^-=\mathbb {E}[g(X) \mathbf{1}_{\{X< c\}}]/\mathbb {P}[X< c]\) and \(m^+=\mathbb {E}[g(X) \mathbf{1}_{\{X> c\}}]/\mathbb {P}[X> c]\). We drop the dependence in *c* for notational convenience. The second term in (5) is equal to

We now write without loss of generality \(c=\mathbb {P}[X< c]\) and \(1-c=\mathbb {P}[X> c]\), as if *X* was uniformly distributed. This is just meant to simplify the notations. We have

where, in the last lines, several simplifications occur. The final expression of the gain follows (combining the above with (33)). The fourth line is a simple application of the definition of the conditional expectation (e.g., \(\mathbb {E}[g(X)\mathbf{1}_{\{X<c \}}]=\mathbb {E}[g(X)|X<c ]P[X<c]\)).

### 1.4 Proof of Proposition 4

We have

so that a double differentiation in *x* and *y* leads to

We then recall the inverse c.d.f. function of *Y*, \(F_Y^{-1}(q)\), which satisfies \(P[y\le F_Y^{-1}(q)]=q\). Conditionally on \(Y>F_Y^{-1}(1-q)\) or \(Y<F_Y^{-1}(q)\), the density of *X* is thus defined as

where the third line was obtained by simple substitution. If \(\mu _X\) is the median of *X*, we omit the scaling denominator and compute

where in the second expression, we have used the symmetry \(f(z,\mu _X+x)=f(z,\mu _X-x)\). To prove the first point of the proposition, it suffices to show that the two terms are equal. By the symmetry of *f*, it suffices to show that the integration ranges of each integral are symmetric around zero, i.e., that the middle points of each range are equidistant from zero. These middle points are located at \(F_Y^{-1}(1-q)-a\mu _X-b\) and \(F_Y^{-1}(q)-a\mu _X-b\). If we assume \(\mu _Y=a\mu _X+b\), the sum of these two values is equal to

because \(F_Y^{-1}(1-q)\) and \(F_Y^{-1}(q)\) are located symmetrically around \(\mu _Y\). To complete the proof of the first point of the proposition, it suffices to show that indeed the medians satisfy \(\mu _Y=a\mu _X+b\).

where in the third equality we have used the symmetry in the first variable and in the last one, the symmetry in the second variable.

For the second point of the proposition, we must resort to a simplified reasoning to lighten the notations. The main function we are interested in is \(f(y-a(\mu _X+x)-b,\mu _X+x)\) and its integration range is \(y\notin \left( F_Y^{-1}(q),F_Y^{-1}(1-q)\right) \). If we substitute \(z=-y+2(a\mu _X+b)=-y+2\mu _X\), the function, by symmetry in the first variable, becomes \(f(z-a(\mu _X-x)-b,\mu _X+x)\) and the integration range \(z\notin \left( 2\mu _Y-F_Y^{-1}(q),2\mu _Y-F_Y^{-1}(1-q)\right) \) since \(F_Y^{-1}(q)\) and \(F_Y^{-1}(1-q)\) are equidistant from \(\mu \), a simplification occurs and the interval is simply inverted: \(z\notin \left( F_Y^{-1}(1-q),F_Y^{-1}(q)\right) \). This inversion is then compensated by the inverse sign between *dy* and *dz*. We are now ready to provide the full details. For notational convenience, we introduce the interval \(S_q=\left( F_Y^{-1}(1-q),F_Y^{-1}(q)\right) \).

where in the third equality, we have discarded the \(2\mu _Y\) terms which cancel out and have used two symmetry properties: that of the denominator which was proven above and that of the second variable in *f*.

### Details on the quadratic form

We provide further insights on the case when \(g(x)=(x-b)^2\). Notably, we plot the functions that help shed light on the main issues. Essentially, our results are confirmed when looking at the objective function *V* defined in (16). Up to a factor that does not depend on *c* (the quadratic term in \(z^2\) in \(\pi ^-\) and \(\pi ^+\) defined in (17) and (18)), it can be evaluated as

and we plot the second part for four values of *b* in Fig. 16 below.

In all cases, the objective function immediately starts to decrease at zero. The main difference is that for \(b\in (1/3,2/3)\), the decrease is followed by a small bump. We observe it for \(b=0.4\) and \(b=0.45\). The cause of the bump is not visually obvious but can be summarized as follows. Around 0.25, the decrease in \(\pi ^+\) is slow in the region around \(c=0.20\) to \(c=0.25\) and is more than compensated by the increase in \(\pi ^-\), hence the sum of the two increases slightly. However, past \(c=0.5\), the decrease in \(\pi ^+\) is sharp while the increase in \(\pi ^-\) is slower. Hence, the true minimum is located between 0.5 and 1 (0.786 in this case).

Lastly, we highlight that the bump implies a local maxima. It is where the function *h* has a zero. This was treated in the proof of Proposition 1, in the case when \(m^-(c)-m^+(c)= 0\).

### Impact of filter on exponential generator

See Fig. 17.

### Features

See Table 6.

## Rights and permissions

## About this article

### Cite this article

Coqueret, G., Guida, T. Training trees on tails with applications to portfolio choice.
*Ann Oper Res* **288**, 181–221 (2020). https://doi.org/10.1007/s10479-020-03539-2

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10479-020-03539-2