Skip to main content
Log in

Training trees on tails with applications to portfolio choice

  • Original Research
  • Published:
Annals of Operations Research Aims and scope Submit manuscript


In this article, we investigate the impact of truncating training data when fitting regression trees. We argue that training times can be curtailed by reducing the training sample without any loss in out-of-sample accuracy as long as the prediction model has been trained on the tails of the dependent variable, that is, when ‘average’ observations have been discarded from the training sample. Filtering instances has an impact on the features that are selected to yield the splits and can help reduce overfitting by favoring predictors with monotonous impacts on the dependent variable. We test this technique in an out-of-sample exercise of portfolio selection which shows its benefits. The implications of our results are decisive for time-consuming tasks such as hyperparameter tuning and validation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.


  1. See, e.g., Ballings et al. (2015), Patel et al. (2015), Moritz and Zimmermann (2016), Krauss et al. (2017), Gu et al. (2019) and Huck (2019). Some papers use tree-based scenarios, like Köksalan and Şakar (2016) and Yan et al. (2019), but those are very different tools.

  2. One reason why monotonous patterns matter is that they are robust, hence an investor can confidently rely on them to build portfolios. However, in practice, monotonicity tests are seldom carried out even though they exist. Romano and Wolf (2013) detail one such procedure for instance. For simplicity, so-called factors are often determined by looking at differences related to extreme values of features. The whole distribution of features is rarely exploited for at least two reasons. First, it is less straightforward and harder to assess. Second, it is likely to reduce the significance of factors. Because of the publication bias towards positive results, academics seek statistical significance and thus ignore tests that are likely to curtail it.

  3. The material can be accessed here:

  4. In this particular example, allocating to highly volatile stocks resembles a lottery bet and is as such very risky indeed.

  5. The multivariate analysis involving all features (and how they compete) will come later in the paper, from Sect. 5.2 onwards.

  6. It is generally admitted that monotonous transformations of predictive variables do not substantially alter the structure of the tree, or, more generally, of the predictions of other ML engines. We resort to this technique for analytical tractability, but we highlight in all transparency that it can have non-negligible effects, as is shown in Galili and Meilijson (2016). Furthermore, we recall that normalisations and scaling procedures are common practice in machine learning: it is known that neural networks perform better when all variables have the same scales: either they are located inside a bounded interval (usually \([-1,1]\) or [0, 1]), or they are standardized to reach unit variance. Finally, it can also be argued that the prior transformation of data points could be directly embedded in the definition of the generator g. In any case, such scaling procedures are now commonplace in recent in asset pricing theory contributions: we refer to Kelly et al. (2019) and Koijen and Yogo (2019) to name but a few.

  7. This is justified by Property 1 of Corollary 1. In the general case, \(g(x)=a_2x^2+a_1x+a_0\) with \(a_2\ne 0\), the function h in (7) is a polynomial of degree four and thus has four roots: 0, 1 and

    $$\begin{aligned} x^\pm =\frac{a_2-3a_1\pm \sqrt{17a_2+18a_1a_2+9a_1}}{8a_2}, \end{aligned}$$

    where the term inside the square root is positive over the \(\mathbb {R}^2\) plane. Notably, \(a_0\) plays not role in the splitting location, which makes sense. For the sake of simplicity, we choose to reduce the study to a one-parameter family.

  8. The full details of the derivation are available upon request.

  9. More precisely, the calculation is that of the loess (locally estimated scatterplot smoothing) function.

  10. This interval leads to retain between 30 and 50% of the original sample. These figures are those that we use in practice because they keep between one half and one third of the original data, which is reasonable.

  11. For a review on asset pricing anomaly detection, we refer to Goyal (2012).

  12. The notion of ‘significant’ difference is subject to an ongoing debate among researchers. We refer to Harvey et al. (2016) and Harvey (2017) for more debate on the subject.

  13. Trees usually have depths between 3 and 6.

  14. It is common practice in the money management industry to fix the size of the portfolio policies.

  15. The data and the code can be accessed here:

  16. In linear models, the problem of regressing monthly returns on autocorrelated predictors is well documented since the seminal work of Stambaugh (1999). To circumvent this problem, we resort to a dependent variable that behaves like the predictors. We hope to unveil patterns that will be long-lasting, i.e., that will hold out-of-sample.

  17. An important alternative is the random filter. Nonetheless, it is much more expensive empirically because the performance is strongly dependent on the random seed. Robust results require at least 100 iterations of each backtest. In (partial) unreported results, we document that the random filter is inferior to the extreme filter, especially if the filter is intense.

  18. We saw in the second case in the previous proof that if the first term of (30) is zero, then V is maximized.


  • Ali, Ö. G., & Yaman, K. (2013). Selecting rows and columns for training support vector regression models with large retail datasets. European Journal of Operational Research, 226(3), 471–480.

    Google Scholar 

  • Ammann, M., Coqueret, G., & Schade, J.-P. (2016). Characteristics-based portfolio choice with leverage constraints. Journal of Banking & Finance, 70, 23–37.

    Google Scholar 

  • Asness, C., Frazzini, A., Israel, R., Moskowitz, T. J., & Pedersen, L. H. (2018). Size matters, if you control your junk. Journal of Financial Economics, 129(3), 479–509.

    Google Scholar 

  • Asness, C. S., Frazzini, A., Israel, R., & Moskowitz, T. J. (2014a). Fact, fiction and momentum investing. Journal of Portfolio Management, 40(5), 75–92.

    Google Scholar 

  • Asness, C. S., Frazzini, A., & Pedersen, L. H. (2014b). Quality minus junk. Review of Accounting Studies, 24(1), 1–79.

    Google Scholar 

  • Baker, M., Bradley, B., & Taliaferro, R. (2014). The low-risk anomaly: A decomposition into micro and macro effects. Financial Analysts Journal, 70(2), 43–58.

    Google Scholar 

  • Ballings, M., Van den Poel, D., Hespeels, N., & Gryp, R. (2015). Evaluating multiple classifiers for stock price direction prediction. Expert Systems with Applications, 42(20), 7046–7056.

    Google Scholar 

  • Barroso, P., & Santa-Clara, P. (2015). Momentum has its moments. Journal of Financial Economics, 116(1), 111–120.

    Google Scholar 

  • Bertsimas, D., & Dunn, J. (2017). Optimal classification trees. Machine Learning, 106(7), 1039–1082.

    Google Scholar 

  • Brandt, M. W., Santa-Clara, P., & Valkanov, R. (2009). Parametric portfolio policies: Exploiting characteristics in the cross-section of equity returns. Review of Financial Studies, 22(9), 3411–3447.

    Google Scholar 

  • Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. (1984). Classification and regression trees. London: Chapman and Hall.

    Google Scholar 

  • Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28.

    Google Scholar 

  • Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 785–794). ACM.

  • Chou, P. A. (1991). Optimal partitioning for classification and regression trees. IEEE Transactions on Pattern Analysis & Machine Intelligence, 4, 340–354.

    Google Scholar 

  • Daniel, K., & Moskowitz, T. J. (2016). Momentum crashes. Journal of Financial Economics, 122(2), 221–247.

    Google Scholar 

  • DeMiguel, V., Garlappi, L., & Uppal, R. (2007). Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy? Review of Financial Studies, 22(5), 1915–1953.

    Google Scholar 

  • Esposito, F., Malerba, D., & Semeraro, G. (1997). A comparative analysis of methods for pruning decision trees. IEEE Transactions on Pattern Analysis & Machine Intelligence, 19(5), 476–491.

    Google Scholar 

  • Fama, E. F., & French, K. R. (1992). The cross-section of expected stock returns. Journal of Finance, 47(2), 427–465.

    Google Scholar 

  • Fama, E. F., & French, K. R. (2015). A five-factor asset pricing model. Journal of Financial Economics, 116(1), 1–22.

    Google Scholar 

  • Fu, X., Du, J., Guo, Y., Liu, M., Dong, T., & Duan, X. (2018). A machine learning framework for stock selection. Preprint arXiv:1806.01743.

  • Galili, T., & Meilijson, I. (2016). Splitting matters: how monotone transformation of predictor variables may improve the predictions of decision tree models. Preprint arXiv:1611.04561.

  • Goyal, A. (2012). Empirical cross-sectional asset pricing: a survey. Financial Markets and Portfolio Management, 26(1), 3–38.

    Google Scholar 

  • Green, J., Hand, J. R., & Zhang, X. F. (2013). The supraview of return predictive signals. Review of Accounting Studies, 18(3), 692–730.

    Google Scholar 

  • Gu, S., Kelly, B. T., & Xiu, D. (2019). Empirical asset pricing via machine learning. Review of Financial Studies (Forthcoming)

  • Guida, T., & Coqueret, G. (2018). Ensemble learning applied to quant equity: Gradient boosting in a multifactor framework. In Big data and machine learning in quantitative investment (pp. 129–148). Wiley.

  • Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar), 1157–1182.

    Google Scholar 

  • Harvey, C. R. (2017). Presidential address: The scientific outlook in financial economics. Journal of Finance, 72(4), 1399–1440.

    Google Scholar 

  • Harvey, C. R., Liu, Y., & Zhu, H. (2016).... and the cross-section of expected returns. Review of Financial Studies, 29(1), 5–68.

    Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Berlin: Springer.

    Google Scholar 

  • Huck, N. (2019). Large data sets and machine learning: Applications to statistical arbitrage. European Journal of Operational Research, 278(1), 330–342.

    Google Scholar 

  • Kelly, B. T., Pruitt, S., & Su, Y. (2019). Characteristics are covariances: A unified model of risk and return. Journal of Financial Economics, 134(3), 501–524.

    Google Scholar 

  • Koijen, R. S., & Yogo, M. (2019). A demand system approach to asset pricing. Journal of Political Economy, 127(4), 1475–1515.

    Google Scholar 

  • Köksalan, M., & Şakar, C. T. (2016). An interactive approach to stochastic programming-based portfolio optimization. Annals of Operations Research, 245(1–2), 47–66.

    Google Scholar 

  • Krauss, C., Do, X. A., & Huck, N. (2017). Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500. European Journal of Operational Research, 259(2), 689–702.

    Google Scholar 

  • Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. London: Chapman & Hall/CRC.

    Google Scholar 

  • Linnainmaa, J. T., & Roberts, M. R. (2018). The history of the cross-section of stock returns. Review of Financial Studies, 31(7), 2606–2649.

    Google Scholar 

  • Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol. 454). Berlin: Springer.

    Google Scholar 

  • Loh, W.-Y., et al. (2009). Improving the precision of classification trees. The Annals of Applied Statistics, 3(4), 1710–1737.

    Google Scholar 

  • Lopez, O., Milhaud, X., Thérond, P.-E., et al. (2016). Tree-based censored regression with applications in insurance. Electronic Journal of Statistics, 10(2), 2685–2716.

    Google Scholar 

  • Moritz, B., & Zimmermann, T. (2016). Tree-based conditional portfolio sorts: The relation between past and future stock returns. SSRN Working Paper2740751.

  • Norouzi, M., Collins, M., Johnson, M. A., Fleet, D. J., & Kohli, P. (2015). Efficient non-greedy optimization of decision trees. In Advances in neural information processing systems (pp. 1729–1737).

  • Novy-Marx, R. (2012). Is momentum really momentum? Journal of Financial Economics, 103(3), 429–453.

    Google Scholar 

  • Pätäri, E., & Leivo, T. (2017). A closer look at value premium: Literature review and synthesis. Journal of Economic Surveys, 31(1), 79–168.

    Google Scholar 

  • Patel, J., Shah, S., Thakkar, P., & Kotecha, K. (2015). Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Expert Systems with Applications, 42(1), 259–268.

    Google Scholar 

  • Romano, J. P., & Wolf, M. (2013). Testing for monotonicity in expected asset returns. Journal of Empirical Finance, 23, 93–116.

    Google Scholar 

  • Sasane, A. (2016). Optimization in function spaces. Mineola: Courier Dover Publications.

    Google Scholar 

  • Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1), 1–114.

    Google Scholar 

  • Stahl, F., & Bramer, M. (2012). Jmax-pruning: A facility for the information theoretic pruning of modular classification rules. Knowledge-Based Systems, 29, 12–19.

    Google Scholar 

  • Stambaugh, R. F. (1999). Predictive regressions. Journal of Financial Economics, 54(3), 375–421.

    Google Scholar 

  • Van Dijk, M. A. (2011). Is size dead? A review of the size effect in equity returns. Journal of Banking & Finance, 35(12), 3263–3274.

    Google Scholar 

  • Yan, Z., Chen, Z., Consigli, G., Liu, J., & Jin, M. (2019). A copula-based scenario tree generation algorithm for multiperiod portfolio selection problems. Annals of operations research (pp. 1–33).

Download references


The authors thank an anonymous referee for his/her precious comments that have improved the clarity of the paper.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Guillaume Coqueret.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.




1.1 Proof of Proposition 1

The expressions of the conditional means are

$$\begin{aligned} m^-(c)&= \left( \int _{-\infty }^cf_X(y)dy\right) ^{-1} \int _{-\infty }^cg(y)f_X(y)dy \\ m^+(c)&= \left( \int _{c}^\infty f_X(y)dy\right) ^{-1} \int _{c}^\infty g(y)f_X(y)dy \end{aligned}$$

Likewise, the total dispersion defined in (4) is equal to

$$\begin{aligned} V(c) = \pi ^-(c) + \pi ^+(c), \end{aligned}$$


$$\begin{aligned} \pi ^-(c)&=\int _{-\infty }^c\int _{ \mathbb {R}}\left( g(x)+z-m^-(c) \right) ^2f(z,x)dxdz \end{aligned}$$
$$\begin{aligned} \pi ^+(c)&=\int _{c}^\infty \int _{ \mathbb {R}}\left( g(x)+z-m^+(c) \right) ^2f(z,x)dxdz \end{aligned}$$

Obviously, we work on the asymptotic expressions for analytical tractability. In the sequel, we will drop the N scaling factor in (16) for notational simplicity. We re-arrange \(\pi ^-\) and \(\pi ^+\) via the conditional density:

$$\begin{aligned} \pi ^-(c)&=\int _{-\infty }^c\int _{ \mathbb {R}}\left( g(x)+z-m^-(c) \right) ^2f(z|x)f_X(x)dxdz \\&=\int _{-\infty }^c\int _{ \mathbb {R}}\left[ (g(x)-m^-(c))^2+z^2-2z(g(x)-m^-(c))\right] f(z|x)f_X(x)dxdz \\&=\int _{-\infty }^c\int _{ \mathbb {R}}\left[ (g(x)-m^-(c))^2+z^2\right] f(z,x)dxdz \\ \pi ^+(c)&=\int _{c}^\infty \int _{ \mathbb {R}}\left[ (g(x)-m^+(c))^2+z^2\right] f(z,x)dxdz \end{aligned}$$

where in the third line we have used (2). By the Leibniz integral rule for differentiation, we have

$$\begin{aligned} \frac{\partial }{\partial c}\pi ^-(c)&= \int _\mathbb {R}\left[ (g(c)-m^-(c))^2+z^2\right] f(z,c)dz \\&\quad - 2 \int _{-\infty }^c\int _{ \mathbb {R}}\left( \frac{\partial }{\partial c}m^-(c)\right) (g(x) - m^-(c))f(z,x)dxdz \\ \frac{\partial }{\partial c}\pi ^+(c)&= - \int _\mathbb {R}\left[ (g(c)-m^+(c))^2+z^2\right] f(z,c)dz \\&\quad - 2 \int _{c}^\infty \int _{ \mathbb {R}}\left( \frac{\partial }{\partial c}m^+(c)\right) (g(x) - m^+(c))f(z,x)dxdz \end{aligned}$$


$$\begin{aligned} \frac{\partial }{\partial c}m^-(c)&=\left( \int _{-\infty }^cf_X(y)dy\right) ^{-1} g(c) f_X(c) - f_X(c) \int _{-\infty }^cg(y)f_X(y)dy \left( \int _{-\infty }^cf_X(y)dy\right) ^{-2} \nonumber \\&= \left( \int _{-\infty }^cf_X(y)dy\right) ^{-1} f_X(c) \left( g(c)- m^-(c)\right) , \end{aligned}$$
$$\begin{aligned} \frac{\partial }{\partial c}m^+(c)&=-\left( \int _{c}^\infty f_X(y)dy\right) ^{-1} f_X(c) \left( g(c)- m^+(c)\right) . \end{aligned}$$

Hence, after multiple simplifications the first order derivative satisfies

$$\begin{aligned} \frac{\partial }{\partial c}V(c)=&\frac{\partial }{\partial c}\pi ^-(c) +\frac{\partial }{\partial c}\pi ^+(c) \nonumber \\&=\int _\mathbb {R}\left[ m^-(c)^2-m^+(c)^2-2g(c)(m^-(c)-m^+(c)) \right] f(z,c) dz \nonumber \\&\quad -2 \left( \int _{-\infty }^cf_X(y)dy\right) ^{-1} f_X(c) \left( g(c)- m^-(c)\right) \times \int _{-\infty }^c(g(x)-m^-(c))f_X(x)dx \end{aligned}$$
$$\begin{aligned}&\qquad +2\left( \int _{c}^\infty f_X(y)dy\right) ^{-1} f_X(c) \left( g(c)- m^+(c)\right) ]\times \int _{c}^\infty (g(x)-m^+(c))f_X(x)dx \end{aligned}$$

In (21) and (22), the integrals on the right are equal to zero, which leads to

$$\begin{aligned} \frac{\partial }{\partial c}V(c)&= f_X(c)\left[ m^-(c)^2 - m^+(c)^2 -2g(c) (m^-(c)-m^+(c))\right] \nonumber \\&\quad f_X(c)\left[ m^-(c) + m^+(c)^2 -2g(c)\right] (m^-(c)-m^+(c)), \end{aligned}$$


$$\begin{aligned} \frac{\partial }{\partial c}V(c)=0 \quad \Leftrightarrow \quad&\left\{ \begin{array}{l}m^-(c)+m^+(c)=2g(c) \\ \quad \text {and / or} \\ m^-(c)-m^+(c)=0. \end{array} \right. \end{aligned}$$

In the equivalence above, we have relied on \(f_X(c)>0\). Indeed, support of \(f_X\) is an interval and on this interval it is assumed that \(f_X>0\). The spit must occur at some point inside the support of \(f_X\) (i.e., the input data), hence \(f_X(c)> 0\). Indeed, a split cannot by definition be located at a point to the left or to the right of all of the points \(X_i\).

We discuss the second order condition below. The second order derivative is obtained by differentiating (23):

$$\begin{aligned} \frac{\partial ^2}{\partial c^2}V(c)&=f_X'(c)\left[ m^-(c) - m^+(c)\right] \left[ m^-(c)+m^+(c)-2g(c) \right] \end{aligned}$$
$$\begin{aligned}&\quad +2f_X(c)\left( m^-(c)\frac{\partial }{\partial c}m^-(c)-m^+(c)\frac{\partial }{\partial c}m^+(c) \right) \end{aligned}$$
$$\begin{aligned}&\quad -2f_X(c)(g'(c)\left( m^-(c)-m^+(c)\right) ) \end{aligned}$$
$$\begin{aligned}&\quad -2f_X(c)g(c)\left( \frac{\partial }{\partial c}m^-(c)-\frac{\partial }{\partial c}m^+(c) \right) , \end{aligned}$$

where we write \(f'\) for the derivative of f. When the first order condition is met, we have the three cases from Eq. (24):

  • Case\(m^-(c)+m^+(c)=2g(c)\) and \(m^-(c)-m^+(c)\ne 0\): we replace g(c) in (25) and (28)

    $$\begin{aligned} \frac{\partial ^2}{\partial c^2}V(c)&= -2f_X(c)g'(c)\left( m^-(c)-m^+(c)\right) \nonumber \\&\quad + f_X(c) [m^-(c)-m^+(c)]\left( \frac{\partial }{\partial c}m^-(c)+\frac{\partial }{\partial c}m^+(c)\right) , \end{aligned}$$

    which implies that \((m^-(c)-m^+(c))\left( \frac{\partial }{\partial c}m^-(c)+\frac{\partial }{\partial c}m^+(c)-2g'(c) \right) \ge 0\) is required to achieve minimisation. Plugging \(g(c)=(m^-(c)+m^+(c))/2\) in (19) and (20) further simplifies this condition to

    $$\begin{aligned}&(m^-(c)-m^+(c))\left[ f_X(c)\frac{m^+(c)-m^-(c)}{2}\left( \left( \int _{-\infty }^cf_X(y)dy\right) ^{-1}\right. \right. \\&\quad \left. \left. +\left( \int _{c}^\infty f_X(y)dy\right) ^{-1} \right) -2g'(c)\right] \ge 0\\&\quad \Leftrightarrow \quad -f_X(c)\frac{(m^+(c)-m^-(c))^2}{2}\\&\quad -2(m^-(c)-m^+(c))\left( \int _{-\infty }^cf_X(y)dy\right) \left( \int _{c}^\infty f_X(y)dy\right) g'(c) \ge 0 \end{aligned}$$

    In all generality it seems impossible to go further than this inequality but we discuss a special case in the subsequent proof.

  • Case\(m^-(c)-m^+(c)= 0\) and \(2g(c)\ne (m^-(c)+m^+(c))\): lines (25) and (27) vanish so

    $$\begin{aligned} \frac{\partial ^2}{\partial c^2}V(c)&=2f_X(c)\left( m^-(c)\frac{\partial }{\partial c}m^-(c)-m^+(c)\frac{\partial }{\partial c}m^+(c) \right) \\&\quad -2f_X(c)g(c)\left( \frac{\partial }{\partial c}m^-(c)-\frac{\partial }{\partial c}m^+(c) \right) \\&= 2f_X(c)(m^-(c)-g(c)) \frac{\partial }{\partial c}m^-(c)-2f_X(c)(m^+(c)-g(c)) \frac{\partial }{\partial c}m^+(c) \\&= -2[f_X(c)(m^-(c)-g(c))]^2 \left( \int _{-\infty }^cf_X(y)dy\right) ^{-1}\\&\quad -2[f_X(c)(m^+(c)-g(c))]^2 \left( \int _{c}^\infty f_X(y)dy\right) ^{-1} \end{aligned}$$

    where we have used expressions (19) and (20). This last line is always negative, thereby implying that the split is maximising the dispersion function V. It can thus be ruled out.

  • Case\(m^-(c)-m^+(c)= 0\) and \(2g(c)= (m^-(c)+m^+(c))\): by (29), this leads to \(\frac{\partial ^2}{\partial c^2}V(c)=0\), which is inconclusive. This case is clearly degenerate as it implies \(g(c)=m^-(c)=-m^+(c)\) by adding or subtracting the two conditions. According to the second condition, all three terms must be equal to zero. We rule this case out.

1.2 Proof of Proposition 2

First, when \(f_X(x)=1_{\{ x\in [0,1]\}}\), we have \(m^-(c)-m^+(c)=\frac{\int _0^cg(y)dy-c\int _0^1g(y)dy}{c(1-c)}\) and \(m^-(c)+m^+(c)-2g(c)=\frac{h(c)}{c(1-c)}\), where h is defined in (7). Hence, by (23),

$$\begin{aligned} \frac{\partial }{\partial c}V(c)=\frac{\int _0^cg(y)dy-c\int _0^1g(y)dy}{c(1-c)} \times \frac{h(c)}{c(1-c)}. \end{aligned}$$

In addition, g is bounded on [0, 1], thus after some simplifications,

$$\begin{aligned}&\underset{c \rightarrow 0^+}{\text {lim}}\frac{\partial }{\partial c}V(c)=-\left( g(0)-\int _0^1g(y)dy\right) ^2\le 0 \quad \text {and}\\&\underset{c \rightarrow 1^-}{\text {lim}}\frac{\partial }{\partial c}V(c)=\left( g(1)-\int _0^1g(y)dy\right) ^2\ge 0. \end{aligned}$$

This implies that V starts by decreasing to the right of zero, and ends by increasing to the left of one. Because \(h(0)=h(1)=0\), it holds that \(V(0)=V(1)=0\). Hence, if there exists at least one root to \(\frac{\partial }{\partial c}V(c)=0\) in (0, 1), then one of them minimizes V over this interval.

Next, we list some properties of h defined in (7) because the solution to \(\frac{\partial }{\partial c}V(c)=0\) with \(\frac{\partial ^2}{\partial c^2}V(c)\ge 0\) reduces to that of \(h(c)=0\).Footnote 18 Plainly,

$$\begin{aligned} h'(c)&=-2\int _0^cg(y)dy+\int _0^1g(y)dy-(1-2c)g(c)-2c(1-c)g'(c), \end{aligned}$$
$$\begin{aligned} h''(c)&= -3(1-2c)g'(c)-2c(1-c)g''(c), \end{aligned}$$

so that \(h'(0)=\int _0^1g(y)dy-g(0)\) and \(h'(1)=g(1)-\int _0^1g(y)dy\), as well as \(h''(0)=-3g'(0)\) and \(h''(1)=3g'(1)\). Under Condition 1, both \(h'(0)\) and \(h'(1)\) have the same sign. Hence, both near 0 and 1, h has the same monotonous behaviour (increasing or decreasing). Since \(h(0)=h(1)=0\), this requires that \(h'\) switches signs at least twice. Hence the solution to (7) does exist and is located between the smallest root and largest root of \(h'(x)=0\) contained in (0, 1).

1.3 Proof of Proposition 3

We start with the first statement. First of all, from Proposition 1, we have that the optimal split is not affected by the noise structure. Second, the dispersion term V(c) in 4 can be re-written as

$$\begin{aligned} V(c)= \mathbb {E}\left[ (g(X)-m^-(c))^2\mathbf{1}_{\{X<c \}} \right] +\mathbb {E}\left[ (g(X)-m^+(c))^2\mathbf{1}_{\{X>c \}} \right] +\sigma _E^2, \end{aligned}$$

where the terms \(m^\pm (c)\) depend on X only. The numerator in (5) is

$$\begin{aligned} \mathbb {V}[Y]=\mathbb {E}[(g(X)+E-\mathbb {E}[g(X])^2]=\mathbb {E}[g(X)^2]-\mathbb {E}[g(X)]^2+\sigma _E^2, \end{aligned}$$

so that any gain has form

$$\begin{aligned} G=\mathbb {E}[g(X)^2]-\mathbb {E}[g(X)]^2- \mathbb {E}\left[ (g(X)-m^-(c))^2\mathbf{1}_{\{X<c \}} \right] +\mathbb {E}\left[ (g(X)-m^+(c))^2\mathbf{1}_{\{X>c \}} \right] , \end{aligned}$$

which does not depend on \(\sigma _E^2\).

We then switch to the second statement. We recall \(m^-=\mathbb {E}[g(X) \mathbf{1}_{\{X< c\}}]/\mathbb {P}[X< c]\) and \(m^+=\mathbb {E}[g(X) \mathbf{1}_{\{X> c\}}]/\mathbb {P}[X> c]\). We drop the dependence in c for notational convenience. The second term in (5) is equal to

$$\begin{aligned} V(c)&= \mathbb {E}\left[ \left( g(X)-m^- \right) ^2 \mathbf{1}_{\{X< c\}}\right] +\mathbb {E}\left[ \left( g(X)-m^+ \right) ^2 \mathbf{1}_{\{X> c\}}\right] +\sigma _E^2 \\&= \mathbb {E}[g(X)^2] + (m^-)^2\mathbb {P}[X< c]+ (m^+)^2\mathbb {P}[X> c] -2m^-\mathbb {E}[g(X) \mathbf{1}_{\{X< c\}}]\\&\quad -2m^+\mathbb {E}[g(X) \mathbf{1}_{\{X> c\}}]+\sigma _E^2 \\&= \mathbb {E}[g(X)^2] - (m^-)^2\mathbb {P}[X< c]- (m^+)^2\mathbb {P}[X> c] +\sigma _E^2 \end{aligned}$$

We now write without loss of generality \(c=\mathbb {P}[X< c]\) and \(1-c=\mathbb {P}[X> c]\), as if X was uniformly distributed. This is just meant to simplify the notations. We have

$$\begin{aligned} V(c)&= \mathbb {E}[g(X)^2] -\mathbb {E}[g(X)]^2 +\mathbb {E}[g(X)]^2 -\frac{\mathbb {E}[g(X)\mathbf{1}_{\{X<c \}}]^2}{c}-\frac{\mathbb {E}[g(X)\mathbf{1}_{\{X>c \}}]^2}{1-c}+\sigma _E^2 \\&= \mathbb {E}[g(X)^2] -\mathbb {E}[g(X)]^2 \\&\quad +\frac{c(1-c)\mathbb {E}[g(X)(\mathbf{1}_{\{X<c \}}+\mathbf{1}_{\{X>c \}})]^2-(1-c)\mathbb {E}[g(X)\mathbf{1}_{\{X<c \}}]^2-c\mathbb {E}[g(X)\mathbf{1}_{\{X>c \}}]^2}{c(1-c)} +\sigma _E^2 \\&=\mathbb {E}[g(X)^2] -\mathbb {E}[g(X)]^2 - \frac{\left( (1-c)\mathbb {E}[g(X)\mathbf{1}_{\{X<c \}}]-c\mathbb {E}[g(X)\mathbf{1}_{\{X>c \}}] \right) ^2}{c(1-c)}+\sigma _E^2 \\&= \mathbb {V}[Y]-c(1-c)(\mathbb {E}[g(X)\mathbf{1}_{\{X<c \}}]-\mathbb {E}[g(X)\mathbf{1}_{\{X>c \}}] )^2 \end{aligned}$$

where, in the last lines, several simplifications occur. The final expression of the gain follows (combining the above with (33)). The fourth line is a simple application of the definition of the conditional expectation (e.g., \(\mathbb {E}[g(X)\mathbf{1}_{\{X<c \}}]=\mathbb {E}[g(X)|X<c ]P[X<c]\)).

1.4 Proof of Proposition 4

We have

$$\begin{aligned} P[X \le x, Y \le y]=\int _{v \in [0,x]}\int _{z\in [-\infty ,y-ax-b]}f(z,v)dvdz, \end{aligned}$$

so that a double differentiation in x and y leads to

$$\begin{aligned} P[X\in dx, Y \in dy]=f(y-ax-b,x) dx dy. \end{aligned}$$

We then recall the inverse c.d.f. function of Y, \(F_Y^{-1}(q)\), which satisfies \(P[y\le F_Y^{-1}(q)]=q\). Conditionally on \(Y>F_Y^{-1}(1-q)\) or \(Y<F_Y^{-1}(q)\), the density of X is thus defined as

$$\begin{aligned} P_q(x)&=P\left[ X\in dx|\{(Y>F_Y^{-1}(1-q))\cup (Y<F_Y^{-1}(q)) \}\right] \end{aligned}$$
$$\begin{aligned}&=\frac{\int _{(y>F_Y^{-1}(1-q))\cup (y<F_Y^{-1}(q))}f(y-ax-b,x)dy}{\int _\mathbb {R}\int _{(y>F_Y^{-1}(1-q))\cup (y<F_Y^{-1}(q))}f(y-ax-b,x)dxdy}dx \nonumber \\&=\frac{\int _{(y>F_Y^{-1}(1-q)-ax-b)\cup (y<F_Y^{-1}(q)-ax-b)}f(y,x)dy}{\int _\mathbb {R}\int _{(y>F_Y^{-1}(1-q)-ax-b)\cup (y<F_Y^{-1}(q)-ax-b)}f(y,x)dxdy}dx, \end{aligned}$$

where the third line was obtained by simple substitution. If \(\mu _X\) is the median of X, we omit the scaling denominator and compute

$$\begin{aligned}&P_q(\mu _X+x)-P_q(\mu _X-x)\\&\quad \propto \int _{(y>F_Y^{-1}(1-q)-a(\mu _X+x)-b)\cup (y<F_Y^{-1}(q)-a(\mu _X+x)-b)}f(y,\mu _X+x)dy\\&\qquad -\int _{(y>F_Y^{-1}(1-q)-a(\mu _X-x)-b)\cup (y<F_Y^{-1}(q)-a(\mu _X-x)-b)}f(y,\mu _X-x)dy \\&\quad = \int _{(y>F_Y^{-1}(1-q)-a(\mu _X+x)-b)\cup (y<F_Y^{-1}(q)-a(\mu _X+x)+b)}f(y,\mu _X+x)dy\\&\qquad -\int _{(y>F_Y^{-1}(1-q)-a(\mu _X-x)-b)\cup (y<F_Y^{-1}(q)-a(\mu _X-x)-b)}f(y,\mu _X+x)dy \\&\quad = \int _{F_Y^{-1}(1-q)-a(\mu _X+x)-b}^{F_Y^{-1}(1-q)-a(\mu _X-x)-b}f(y,\mu _X+x)dy \\&\qquad - \int _{F_Y^{-1}(q)-a(\mu _X+x)-b}^{F_Y^{-1}(q)-a(\mu _X-x)-b}f(y,\mu _X+x)dy. \end{aligned}$$

where in the second expression, we have used the symmetry \(f(z,\mu _X+x)=f(z,\mu _X-x)\). To prove the first point of the proposition, it suffices to show that the two terms are equal. By the symmetry of f, it suffices to show that the integration ranges of each integral are symmetric around zero, i.e., that the middle points of each range are equidistant from zero. These middle points are located at \(F_Y^{-1}(1-q)-a\mu _X-b\) and \(F_Y^{-1}(q)-a\mu _X-b\). If we assume \(\mu _Y=a\mu _X+b\), the sum of these two values is equal to

$$\begin{aligned} F_Y^{-1}(1-q)+F_Y^{-1}(q)-2(a\mu _X+b)=F_Y^{-1}(1-q)+F_Y^{-1}(q)-2\mu _Y=0, \end{aligned}$$

because \(F_Y^{-1}(1-q)\) and \(F_Y^{-1}(q)\) are located symmetrically around \(\mu _Y\). To complete the proof of the first point of the proposition, it suffices to show that indeed the medians satisfy \(\mu _Y=a\mu _X+b\).

$$\begin{aligned}&P[Y\le a\mu _X+b]-P[Y\ge a\mu _X+b]\\&\quad = \int _{-\infty }^{a\mu _X+b}\int _\mathbb {R}f(y-ax-b,x) dx dy -\int _{a\mu _X+b}^\infty \int _\mathbb {R}f(y-ax-b,x) dx dy\\&\quad =\int _\mathbb {R}\left( \int _{-\infty }^{a(\mu _X-x)}f(y,x) dy\right) dx -\int _\mathbb {R}\left( \int _{a(\mu _X-x)}^\infty f(y,x) dy\right) dx \\&\quad =\int _\mathbb {R}\left( \int _{-\infty }^{a(\mu _X-x)}f(y,x) dy\right) dx -\int _\mathbb {R}\left( \int ^{a(x-\mu _X)}_{-\infty } f(y,x) dy\right) dx \\&\quad =\int _\mathbb {R}\left( \int _{-\infty }^{\mu _X-y/a}f(y,x) dx\right) dy-\int _\mathbb {R}\left( \int _{\mu _X+y/a}^\infty f(y,x) dx\right) dy \\&\quad =0, \end{aligned}$$

where in the third equality we have used the symmetry in the first variable and in the last one, the symmetry in the second variable.

For the second point of the proposition, we must resort to a simplified reasoning to lighten the notations. The main function we are interested in is \(f(y-a(\mu _X+x)-b,\mu _X+x)\) and its integration range is \(y\notin \left( F_Y^{-1}(q),F_Y^{-1}(1-q)\right) \). If we substitute \(z=-y+2(a\mu _X+b)=-y+2\mu _X\), the function, by symmetry in the first variable, becomes \(f(z-a(\mu _X-x)-b,\mu _X+x)\) and the integration range \(z\notin \left( 2\mu _Y-F_Y^{-1}(q),2\mu _Y-F_Y^{-1}(1-q)\right) \) since \(F_Y^{-1}(q)\) and \(F_Y^{-1}(1-q)\) are equidistant from \(\mu \), a simplification occurs and the interval is simply inverted: \(z\notin \left( F_Y^{-1}(1-q),F_Y^{-1}(q)\right) \). This inversion is then compensated by the inverse sign between dy and dz. We are now ready to provide the full details. For notational convenience, we introduce the interval \(S_q=\left( F_Y^{-1}(1-q),F_Y^{-1}(q)\right) \).

$$\begin{aligned}&\mathcal {E}_q(\mu _X+x)-\mathcal {E}_q(\mu _X)=\frac{\int _{y\notin S_q}y f(y-a(\mu _X+x)-b,\mu _X+x) dy}{\int _{y\notin S_q} f(y-a(\mu _X+x)-b,\mu _X+x) dy}\\&\quad -\frac{\int _{y\notin S_q}y f(y-a\mu _X-b,\mu _X) dy}{\int _{y\notin S_q} f(y-a\mu _X-b,\mu _X) dy}\\&\quad =\frac{\int _{y\notin S_q}(2\mu _Y-z) f(z-a(\mu _X-x)-b,\mu _X+x) dz}{\int _{y\notin S_q} f(y-a(\mu _X+x)-b,\mu _X+x) dy}\\&\quad -\frac{\int _{y\notin S_q}(2\mu _Y-z) f(z-a\mu _X-b,\mu _X) dz}{\int _{y\notin S_q} f(y-a\mu _X-b,\mu _X) dy} \\&\quad =-\frac{\int _{y\notin S_q}z f(z-a(\mu _X-x)-b,\mu _X-x) dz}{\int _{y\notin S_q} f(y-a(\mu _X-x)-b,\mu _X-x) dy}+\frac{\int _{y\notin S_q}z f(z-a\mu _X-b,\mu _X) dz}{\int _{y\notin S_q} f(y-a\mu _X-b,\mu _X) dy} \\&\quad =-\mathcal {E}_q(\mu _X-x)+\mathcal {E}_q(\mu _X), \end{aligned}$$

where in the third equality, we have discarded the \(2\mu _Y\) terms which cancel out and have used two symmetry properties: that of the denominator which was proven above and that of the second variable in f.

Details on the quadratic form

We provide further insights on the case when \(g(x)=(x-b)^2\). Notably, we plot the functions that help shed light on the main issues. Essentially, our results are confirmed when looking at the objective function V defined in (16). Up to a factor that does not depend on c (the quadratic term in \(z^2\) in \(\pi ^-\) and \(\pi ^+\) defined in (17) and (18)), it can be evaluated as

$$\begin{aligned} V_{\text {quad}}(c)=\sigma ^2_E+ \frac{4 - 15b + 15b^2 - 5c + 30bc - 45b^2c - 5c^2 + 45b^2c^2 + 5c^3 - 30bc^3 + 5c^4}{45} , \end{aligned}$$

and we plot the second part for four values of b in Fig. 16 below.

Fig. 16
figure 16

Illustrations: On the left graph, we plot \(V_{\text {quad}}(c)\) defined in Eq. (36) with \(\sigma ^2_E=0\). On the right graph, we show the sample function \(g(x)=(x-0.4)^2\), along with the optimal splits

In all cases, the objective function immediately starts to decrease at zero. The main difference is that for \(b\in (1/3,2/3)\), the decrease is followed by a small bump. We observe it for \(b=0.4\) and \(b=0.45\). The cause of the bump is not visually obvious but can be summarized as follows. Around 0.25, the decrease in \(\pi ^+\) is slow in the region around \(c=0.20\) to \(c=0.25\) and is more than compensated by the increase in \(\pi ^-\), hence the sum of the two increases slightly. However, past \(c=0.5\), the decrease in \(\pi ^+\) is sharp while the increase in \(\pi ^-\) is slower. Hence, the true minimum is located between 0.5 and 1 (0.786 in this case).

Lastly, we highlight that the bump implies a local maxima. It is where the function h has a zero. This was treated in the proof of Proposition 1, in the case when \(m^-(c)-m^+(c)= 0\).

Impact of filter on exponential generator

See Fig. 17.

Fig. 17
figure 17

Clustering post filter: the case of exponential generators. We plot the data points and show the filter effect (\(q=0.15\)) with colors. The initial generator is the full black line while the one post-filter is in dotted lines. The rectangles show the homogeneous clusters: the dotted one (post-filter) goes further to the right and the split is closer to the median


See Table 6.

Table 6 List of the 108 features

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Coqueret, G., Guida, T. Training trees on tails with applications to portfolio choice. Ann Oper Res 288, 181–221 (2020).

Download citation

  • Published:

  • Issue Date:

  • DOI: