Skip to main content
Log in

Comments on: A random forest guided tour

  • Discussion
  • Published:
TEST Aims and scope Submit manuscript

Abstract

This paper is a comment on the survey paper by Biau and Scornet (TEST, 2016. doi:10.1007/s11749-016-0481-7) about random forests. We focus on the problem of quantifying the impact of each ingredient of random forests on their performance. We show that such a quantification is possible for a simple pure forest, leading to conclusions that could apply more generally. Then, we consider “hold-out” random forests, which are a good middle point between “toy” pure forests and Breiman’s original random forests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sylvain Arlot.

Additional information

This comment refers to the invited paper available at: doi:10.1007/s11749-016-0481-7.

The research of the authors was partly supported by the French Agence Nationale de la Recherche (ANR 2011 BS01 010 01 projet Calibration). S. Arlot was also partly supported by Institut des Hautes Études Scientifiques (IHES, Le Bois-Marie, 35, route de Chartres, 91440 Bures-Sur-Yvette, France).

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 152 KB)

Appendices

Appendix 1: Approximation and estimation errors

We state a general decomposition of the risk of a forest having the X-property (that is, when partitions are built independently from \((Y_i)_{1 \le i \le n}\)), that we need for proving the results of Sect. 1, but can be useful more generally. We assume that \({\mathbb {E}}[Y_i^2]<+\infty \) for all i.

For any random forest \(m_{M,n}\) having the X-property, following Sections 2 and 3.2 of Biau and Scornet’s survey, we can write

$$\begin{aligned} \begin{array}{c} \displaystyle m_{M,n} ({\mathbf {x}};{\varTheta }_{1 \ldots M}, {\mathscr {D}}_n) =\sum _{i=1}^n W_{ni}({\mathbf {x}}) Y_i\\ \displaystyle \quad \text {where} \quad W_{ni}({\mathbf {x}}) = W_{ni}({\mathbf {x}} ;{\varTheta }_{1 \ldots M} , X_{1 \ldots n}) = \frac{1}{M} \sum _{j=1}^M \frac{C_i({\varTheta }_j) {\mathbf {1}}_{X_i \in A_n({\mathbf {x}}; {\varTheta }_j ; X_{1 \ldots n})}}{N_n({\mathbf {x}}; {\varTheta }_j ; X_{1 \ldots n})}, \end{array} \end{aligned}$$
(1)

\(C_i({\varTheta }_j)\) is the number of times \((X_i,Y_i)\) appears in the j-th resample, \(A_n({\mathbf {x}}; {\varTheta }_j ; X_{1 \ldots n})\) is the cell containing \({\mathbf {x}}\) in the j-th tree, and

$$\begin{aligned} N_n({\mathbf {x}}; {\varTheta }_j ; X_{1 \ldots n}) = \sum _{i=1}^n C_i({\varTheta }_j) {\mathbf {1}}_{X_i \in A_n({\mathbf {x}}; {\varTheta }_j ; X_{1 \ldots n})}. \end{aligned}$$

Now, let us define

$$\begin{aligned} m^{\star }_{M,n} ({\mathbf {x}};{\varTheta }_{1 \ldots M}, X_{1 \ldots n})= & {} {\mathbb {E}}\Bigl [ m_{M,n} ({\mathbf {x}};{\varTheta }_{1 \ldots M}, {\mathscr {D}}_n) \, \big | \,X_{1 \ldots n}, {\varTheta }_{1 \ldots M} \Bigr ]\\= & {} \sum _{i=1}^n W_{ni}({\mathbf {x}};{\varTheta }_{1 \ldots M} , X_{1 \ldots n}) m(X_i)\\ \text {and}\quad {\overline{m}}^{\star }_{M,n} ({\mathbf {x}};{\varTheta }_{1 \ldots M})= & {} {\mathbb {E}}\Bigl [ m^{\star }_{M,n} ({\mathbf {x}};{\varTheta }_{1 \ldots M}, X_{1 \ldots n}) \, \big | \,{\varTheta }_{1 \ldots M} \Bigr ]. \end{aligned}$$

By definition of the conditional expectation, we can decompose the risk of \(m_{M,n}\) at \({\mathbf {x}}\) into three terms

$$\begin{aligned} {\mathbb {E}}\Bigl [ \bigl ( m_{M,n}({\mathbf {x}}) - m ({\mathbf {x}}) \bigr )^2 \Bigr ]= & {} \underbrace{ {\mathbb {E}}\Bigl [ \bigl ( {\overline{m}}^{\star }_{M,n} ({\mathbf {x}}) - m({\mathbf {x}}) \bigr )^2 \Bigr ] }_{A =\mathrm{approximation}~\mathrm{error}}\nonumber \\&+ \underbrace{ {\mathbb {E}}\Bigl [ \bigl ( m^{\star }_{M,n} ({\mathbf {x}}) - {\overline{m}}^{\star }_{M,n}({\mathbf {x}}) \bigr )^2 \Bigr ] }_{{\varDelta }} \nonumber \\&+ \underbrace{ {\mathbb {E}}\Bigl [ \bigl ( m_{M,n}({\mathbf {x}}) - m^{\star }_{M,n} ({\mathbf {x}}) \bigr )^2 \Bigr ] }_{E = \mathrm{estimation}~\mathrm{error}}. \end{aligned}$$
(2)

In the fixed-design regression setting (where the \(X_i\) are deterministic), A is called approximation error, \({\varDelta }=0\), and E is called estimation error. Things are a bit more complicated in the random-design setting—when \((X_i,Y_i)_{1 \le i \le n}\) are independent and identically distributed—since \({\varDelta } \ne 0\) in general. Up to minor differences related to how \(m_{n}\) is defined on empty cells, A is still the approximation error, and the estimation error is \({\varDelta } + E\).

Let us finally assume that \((X_i,Y_i)_{1 \le i \le n}\) are independent and define

$$\begin{aligned} \sigma ^2(X_i) = {\mathbb {E}}{}\left[ \bigl ( m(X_i) - Y_i \bigr )^2 \, \big | \,X_i \right] {}. \end{aligned}$$

Then, since the weights \(W_{ni}({\mathbf {x}})\) only depend on \({\mathscr {D}}_n\) through \(X_{1 \ldots n}\), we have the following formula for the estimation error

$$\begin{aligned} E = {\mathbb {E}}{}\left[ {} \left( \sum _{i=1}^n W_{ni}({\mathbf {x}}) \bigl ( m(X_i) - Y_i \bigr ) \right) ^2 {} \right] {} = {\mathbb {E}}{}\left[ \sum _{i=1}^n W_{ni}({\mathbf {x}})^2 \sigma ^2(X_i) \right] {}. \end{aligned}$$

For instance, in the homoscedastic case, \(\sigma ^2(X_i) \equiv \sigma ^2\) and

$$\begin{aligned} E = {\mathbb {E}}{}\left[ {} \left( \sum _{i=1}^n W_{ni}({\mathbf {x}}) \bigl ( m(X_i) - Y_i \bigr ) \right) ^2 {} \right] {} = \sigma ^2 {\mathbb {E}}{}\left[ \sum _{i=1}^n W_{ni}({\mathbf {x}})^2 \right] {}. \end{aligned}$$
(3)

Appendix 2: Analysis of the toy forest: proofs

We prove the results stated in Sect. 1 for the one-dimensional toy forest.

Since the toy forest is purely random, all results of Appendix 1 apply, with \({\varTheta } = (T,I)\) and \(C_i({\varTheta }) = {\mathbf {1}}_{i \in I}\). It remains to compute the three terms of Eq. (2).

Since we assume m is of class \({\mathscr {C}}^3\), we can use the results of Arlot and Genuer (2014, Section 4) for the approximation error A (up to minor differences in the definition of \({\overline{m}}^{\star }_{M,n} ({\mathbf {x}})\), due to event where \(A_n({\mathbf {x}};{\varTheta })\) is empty, which has a small probability since \(a \gg k\)). We assume that \(m'({\mathbf {x}}) \ne 0\) and \(m''({\mathbf {x}}) \ne 0\) for simplicity, so the quantities appearing in Table 1 indeed provide the order of magnitude of A.

The middle term \({\varDelta }\) in decomposition (2) is negligible in front of E for a single tree, which can be proved using results from Arlot (2008), as soon as \(m'({\mathbf {x}}) / k \ll \sigma \) and \(a \gg k\). We assume that it can also be neglected for an infinite forest.

For the estimation error, we can use Eq. (3) and the following arguments. First, for every \(i \in \{1, \ldots , n\}\), \(X_i\) belongs to \(A_n({\mathbf {x}};{\varTheta })\) with probability 1 / k. Combined with the subsampling process, we get that

$$\begin{aligned} N_n({\mathbf {x}};{\varTheta }; X_{1 \ldots n}) \sim {\mathscr {B}}{}\left( n , \frac{a}{n k} \right) {} \end{aligned}$$

is close to its expectation a / k with probability almost one if \(a/k \gg \log (n)\). Assuming that this holds simultaneously for a huge fraction of the subsamples, we get the approximation

$$\begin{aligned} W_{ni}^{\mathrm {toy}}({\mathbf {x}})= & {} \frac{1}{M} \sum _{j=1}^M \frac{{\mathbf {1}}_{i \in I_j} {\mathbf {1}}_{X_i \in A_n({\mathbf {x}}; {\varTheta }_j)}}{N_n({\mathbf {x}}; {\varTheta }_j ; X_{1 \ldots n})}\nonumber \\\approx & {} \frac{k}{a} \frac{1}{M} \sum _{j=1}^M {\mathbf {1}}_{i \in I_j} {\mathbf {1}}_{X_i \in A_n({\mathbf {x}}; {\varTheta }_j)} =: {\widetilde{W}}_{ni}^{\mathrm {toy}}({\mathbf {x}}). \end{aligned}$$
(4)

Now, we note that conditionally to \(X_{1 \ldots n}\), the variables \({\mathbf {1}}_{i \in I_j} {\mathbf {1}}_{X_i \in A_n({\mathbf {x}}; {\varTheta }_j)}\), \(j=1, \ldots , M\) are independent and follow a Bernoulli distribution with the same parameter

$$\begin{aligned} \frac{a}{n} \times \bigl ( 1-k|X_i-x| \bigr )_+. \end{aligned}$$

Therefore,

$$\begin{aligned} {\mathbb {E}}{}\left[ {\widetilde{W}}_{ni}^{\mathrm {toy} }({\mathbf {x}})^2 \, \big | \, X_{1 \ldots n} \right] {}= & {} \frac{k^2 }{ n a} {}\left[ {}\left( 1 - \frac{1}{M} \right) {} \frac{a}{n} \Bigl ( \bigl ( 1-k|X_i-x| \bigr )_+ \Bigr )^2 \right. \\&+\left. \frac{1}{M} \bigl ( 1-k|X_i-x| \bigr )_+ \right] {}\\ \text {hence} \quad {\mathbb {E}}{}\left[ {\widetilde{W}}_{ni}^{\mathrm {toy} }({\mathbf {x}})^2 \right] {}= & {} \frac{k}{n a} {}\left[ {}\left( 1 - \frac{1}{M} \right) {} \frac{2 a}{3 n} + \frac{1}{M} \right] {}. \end{aligned}$$

By Eq. (3), this ends the proof of the results in the bottom line of Table 1.

Similar arguments apply for justifying the top line of Table 1, where \(T_j=0\) almost surely.

Note that we have not given a full rigorous proof of the results shown in Table 1, because of the approximation (4) and of the term \({\varDelta }\) that we have neglected. We are convinced that the parts of the proof that we have skipped might only require to add some technical assumptions, which would not help to reach our goal of understanding better random forests in general.

Appendix 3: Details about the experiments

This section describes the experiments whose results are shown in Sect. 2.

Data generation process We take \({\mathscr {X}}= [0,1]^{p}\), with \(p \in \{ 5, 10 \}\). Table 2 only shows the results for \(p=5\). Results for \(p=10\) are shown in supplementary material.

The data \((X_i,Y_i)_{1 \le i \le n_1 + n_2}\) are independent with the same distribution: \(X_i \sim {\mathscr {U}}([0,1]^{p})\), \(Y_i = m(X_i) + \varepsilon _i\) with \(\varepsilon _i \sim {\mathscr {N}}(0,\sigma ^2)\) independent from \(X_i\), \(\sigma ^2 = 1/16\), and the regression function m is defined by

$$\begin{aligned} m : {\mathbf {x}} \in [0,1]^{p} \mapsto \mathbf {1/10} \times [10 \sin (\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5]. \end{aligned}$$

The function m is proportional to the Friedman1 function which was introduced by Friedman (1991). Note that when \(p>5\), m only depends on the 5 first coordinates of \({\mathbf {x}}\).

Then, the two subsamples are defined by \({\mathscr {D}}_{n_1}^1= (X_i,Y_i)_{1 \le i \le n_1}\) and \({\mathscr {D}}_{n_2}^2= (X_i,Y_i)_{n_1 + 1 \le i \le n_1+n_2}\).

We always take \(n_1 = 1280\) and \(n_2 = 25{,}600\).

Trees and forests For each \(k \in \{2^5, 2^6, 2^7, 2^8\}\), each experimental condition (bootstrap or not, \({\mathtt {mtry}}=p\) or \(\lfloor p/3 \rfloor \)), we build some hold-out random trees and forests as defined in Sect. 2. These are built with the randomForest R package (Liaw and Wiener 2002; R Core Team 2015), with appropriate parameters (k is controlled by maxnodes, while \({\mathtt {nodesize}}=1\)).

Resampling within \({\mathscr {D}}_{n_1}^1\) (when there is some resampling) is done with a bootstrap sample of size \(n_1\) (that is, with replacement and \(a_{n_1} = n_1\)).

“Large” forests are made of \(M=k\) trees (a number of trees suggested by Arlot and Genuer 2014).

Estimates of approximation and estimation error Estimating approximation and estimation errors (as defined by Eq. (2)) requires to estimate some expectations over \({\varTheta }\) (which includes the randomness of \({\mathscr {D}}_{n_1}^1\) as well as the randomness of the choice of bootstrap subsamples of \({\mathscr {D}}_{n_1}^1\) and of the repeated choices of a subset \({\mathscr {M}}_{\mathrm {try}}\)). This is done with a Monte-Carlo approximation, with 500 replicates for trees and 10 replicates for forests. This number might seem small, but we observe that large forests are quite stable, hence expectations can be evaluated precisely from a small number of replicates.

We estimate the approximation error (integrated over \({\mathbf {x}}\)) as follows. For each partition that we build, we compute the corresponding “ideal” tree, which maps each piece of the partition to the average of m over it (this average can be computed almost exactly from the definition of m). Then, to each forest we associate the “ideal” forest \({\overline{m}}^{\star }_{M,n}\) which is the average of the ideal trees. We can thus compute \(( {\overline{m}}^{\star }_{M,n} ({\mathbf {x}}) - m({\mathbf {x}}) )^2\) for any \({\mathbf {x}} \in {\mathscr {X}}\), and estimate its expectation with respect to \({\varTheta }\). Averaging these estimates over 1000 uniform random points \({\mathbf {x}} \in {\mathscr {X}}\) provides our estimate of the approximation error.

We estimate the estimation error (integrated over \({\mathbf {x}}\)) from Eq. (3); since \(\sigma ^2\) is known, we focus on the remaining term. Given some hold-out random forest, for any \({\mathbf {x}} \in {\mathscr {X}}\) and \(i \in \{1, \ldots , n\}\), we can compute

$$\begin{aligned} W_{ni}({\mathbf {x}}) = \frac{1}{M} \sum _{j=1}^M \sum _{(X_i,Y_i) \in {\mathscr {D}}_{n_2}^2} \frac{{\mathbf {1}}_{X_i \in A_{n_1}({\mathbf {x}} ; {\varTheta }_j , {\mathscr {D}}_{n_1}^1) }}{N_{n_2}( {\mathbf {x}} ; {\varTheta }_j, {\mathscr {D}}_{n_1}^1, {\mathscr {D}}_{n_2}^2) }. \end{aligned}$$

Then, averaging \(\sum _i W_{ni}({\mathbf {x}})^2\) over several replicate trees/forests and over \(1\,000\) uniform random points \({\mathbf {x}} \in {\mathscr {X}}\), we get an estimate of the estimation error (divided by \(\sigma ^2\)).

Summarizing the results in Table 2 Given the estimates of the (integrated) approximation and estimation errors that we obtain for every \(k \in \{2^5, 2^6, 2^7, 2^8\}\), we plot each kind of error as a function of k (in \(\mathrm {log}_2\)-\(\mathrm {log}_2\) scale for the approximation error), and we fit a simple linear model (with an intercept). The estimated parameters of the model directly give the results shown in Table 2 (in which the value of the intercept for the estimation error is omitted for simplicity). The corresponding graphs are shown in supplementary material.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arlot, S., Genuer, R. Comments on: A random forest guided tour. TEST 25, 228–238 (2016). https://doi.org/10.1007/s11749-016-0484-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-016-0484-4

Keywords

Mathematics Subject Classification

Navigation