1 Introduction

Ensemble algorithms offer state-of-the-art performance in many applications and often outperform single classifiers by a large margin. With the ongoing integration of embedded systems and machine learning models into our everyday life, e.g., in the form of the Internet of Things, the hardware platforms which execute ensembles must also be taken into account when training ensembles.

From a hardware perspective, a small ensemble with minimal execution time and a small memory footprint is desired. Moreover, learning theory indicates that ensembles of small models should generalize better so that they are ideal candidates for small, resource constraint devices (Koltchinskii 2002; Cortes et al. 2014). Practical problems, on the other hand, often require ensembles of complex base learners to achieve good results, and some ensemble techniques such as Random Forest (RF) even prefer individual trees to be as large as possible, leading to overall large ensembles (Breiman 2000; Biau 2012; Denil et al. 2014; Biau and Scornet 2016). As depicted in Table 1, microcontroller units (MCU) for IoT devices typically only offer a few KB to a few MB of memory, while RF models can easily grow beyond that limit (see Tables 4 and 5). Hence, to deploy RF onto these small devices, we require an algorithm that trains good models for a variety of different memory constraints.

Ensemble pruning is a standard technique for transforming a large, memory-hungry ensemble into a smaller ensemble that can be deployed onto a small device by removing classifiers from it (Tsoumakas et al. 2009; Zhang et al. 2006). Remarkably, this removal can sometimes lead to a better predictive performance (Margineantu and Dietterich 1997; Martínez-Muñoz and Suárez 2006; Li et al. 2012) leading to smaller and better ensembles at the same time. Similar, leaf-refinement (LR) is a technique that jointly refines the probability estimates in a given tree ensemble (Ren et al. 2015; Buschjäger and Morik 2021) to improve its performance. Hence, with leaf-refinement, it is possible to refine smaller forests, i.e., a Random Forest with only a few trees, so that it achieves a performance comparable to larger forests, i.e., an RF with many trees (c.f. Ren et al. 2015 and Table 4). We wonder whether we can combine both approaches into a single algorithm that jointly removes unnecessary classifiers from a tree ensemble while further improving its performance by refining the probability estimates in the leaf nodes. To do so, we incorporate \(L_1\) regularization into the leaf-refinement objective and adopt proximal gradient descent to solve this objective. Our contributions are as follows:

  • Unified objective for leaf-refinement and pruning: We present a novel optimization objective that leverages \(L_1\) regularization to select only a few trees from the ensemble while jointly refining the probability estimates in all trees.

  • Pruning via proximal gradient descent (L1 + LR): We present a new algorithm that uses proximal gradient descent to refine the probability estimates of the tree ensemble while pruning it by minimizing our novel objective.

  • Experiments: We study the performance of our algorithm on 20 datasets and compare it against 8 state-of-the-art methods. We conduct over 13,200 experiments with a variety of different hyperparameter configurations and show that our L1 + LR method has the statistically significant best performance in terms of accuracy and \(F_1\) score. Moreover, we show that leaf-refinement, including our L1 + LR method, has the best performance when less than 768 KB of memory is available. Last, we present a real-world use case in the context of IoT warehousing and highlight how our method can be applied to deploy tree ensembles to tiny, ultra-low-power IoT devices.

Table 1 Typical microcontroller units (MCUs) found in edge and IoT devices

The paper is organized as follows. Section 2 presents our notation and related work, including detailed explanations on ensemble pruning and leaf-refinement. In Sect. 3 we present our combination of ensemble pruning and leaf-refinement into a single objective and present a novel algorithm that solves this objective. Section 4 contains the experimental analysis, including a real-world use case in IoT Warehousing. Section 5 concludes the paper.

2 Background and notation

We consider a supervised learning setting in which training and test points are drawn i.i.d. according to some distribution \({\mathcal {D}}\) over the input space \({\mathcal {X}} \subseteq {\mathbb {R}}^d\) of d-dimensional feature vectors and labels \({\mathcal {Y}} \subseteq {\mathbb {R}}^C\). For classification problems with \(C \ge 2\) classes we encode each label as a one-hot vector \(y = (0,\dots ,0,1,0,\dots ,0)\) which contains a ‘1’ at coordinate c for label \(c \in \{0,\dots ,C-1\}\).

We assume that we have given an already trained additive tree ensemble \(\{h_1,\dots ,h_M\}\) of M axis-aligned decision trees (DT) with the following decision function:

$$\begin{aligned} f(x) = \frac{1}{M}\sum _{i=1}^M h_i(x) \end{aligned}$$
(1)

A DT partitions the input space \({\mathcal {X}}\) into d-dimensional hypercubes called leaves and uses independent predictions for each leaf in the tree. To do so, it uses a series of axis-aligned splits of the form \(\mathbbm {1}\{x_i \le t\}\) and \(\mathbbm {1}\{x_i > t\}\) where i is a pre-computed feature index and t is a precomputed threshold to determine the leaf nodes. Each leaf node l contains a probability estimate \({{\widehat{y}}}_l \in {\mathbb {R}}^C\) using the class frequencies of the observations from the training points occurring in that leaf node. Let \(s_l(x) :{\mathcal {X}} \rightarrow \{0,1\}\) be an indicator function that is ‘1’ if x belongs to leaf l and ‘0’ if not, and let L be the total number of leaf nodes in the tree, then the prediction of a tree is given by

$$\begin{aligned} h(x) = \sum _{l = 1}^{L} {{\widehat{y}}}_{l} s_{l}(x) \end{aligned}$$
(2)

Note that per tree construction there is exactly one leaf node visited per example so that \(s_{l}(x) = 1\) for exactly one \(l\in \{1,\dots ,L\}\) whereas the remaining indicators evaluate for to zero. Hence, h(x) effectively evaluates to \({{\widehat{y}}}_{l}\) where l is the corresponding leaf node of the input x.

In this paper, we assume that we have given an already trained forest of DTs, but we do not assume that any specific training algorithm was used to train it. For example, the forest can be a Random Forest (Breiman 2001), a forest of boosted decision trees (Schapire and Freund 2012), etc. For simplicity, we assume that each tree in the ensemble is equally weighted. If the forest is weighted (e.g. as in AdaBoost) so that each classifier \(h'_i\) has a corresponding weight \(w_i\), then we re-scale the individual classifier’s predictions to include the weight. To do so, we scale the probability estimates in all leaves of each tree \(h'_i\) by \(M \cdot w_i\), so that

$$\begin{aligned} f(x) = \sum _{i=1}^M w_i {h'}_i(x) = \frac{1}{M} \sum _{i=1}^M M w_i {h'}_i(x)= \frac{1}{M}\sum _{i=1}^M h_i(x) \end{aligned}$$
(3)

In addition to the trained ensemble, we receive a labeled pruning sample \({\mathcal {S}} = \{(x_i,y_i) \mid i=1,\dots ,N\}\). This sample can either be the original training data used to train f or another pruning data set not related to the training or test data. In this paper, we will focus on classification problems, but note that our approach is also directly applicable to regression tasks. Moreover, we will focus on Random Forests (RF), but note that most of our discussion directly translates to other tree ensembles such as Bagging (Breiman 1996), ExtraTrees (Geurts et al. 2006), Random Subspaces (Ho 1998) or Random Patches (Louppe and Geurts 2012).

2.1 Ensemble pruning

The goal of ensemble pruning is to select a subset of K classifier from \(\{h_1,\dots ,h_M\}\) that forms a small and accurate subensemble. Formally, each classifier \(h_i\) receives a corresponding pruning weight \(w_i \in \{0,1\}\) so that the ensemble’s prediction can be expressed as

$$\begin{aligned} f(x) = \frac{1}{\Vert w \Vert _0}\sum _{i=1}^M w_i h_i(x) \end{aligned}$$
(4)

where \(\Vert w \Vert _0 = \sum _{i=1}^M 1\{w_i > 0\}\) is the \(L_0\) norm that counts the number of nonzero entries in the weight vector \(w = (w_1,\dots ,w_M)\). Many effective ensemble pruning methods have been proposed in the literature. These methods usually differ in the specific loss function used to measure the performance of a subensemble and the way this loss is minimized. Tsoumakas et al. (2009) give a detailed taxonomy of pruning methods that was later expanded in Zhou (2012).

Ranking-based pruning: Early works on ensemble pruning focus on ranking-based approaches that assign a rank to each classifier depending on their individual performance and then select the top K classifier from that ranking. Formally, ranking-based approaches use the following optimization problem:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{w\in \{0,1\}^M} \frac{1}{N} \sum _{(x,y) \in {\mathcal {S}}} \sum _{i=1}^M w_i \ell (h_i(x),y) ~\text{ st. }~ \Vert w \Vert _0 = K \end{aligned}$$
(5)

where \(\ell :{\mathbb {R}}^C \times {\mathcal {Y}} \rightarrow {\mathbb {R}}\) is a loss function. To solve this objective the following approach can be used: First, all individual losses \(\ell (h_i(x),y)\) are computed and sorted in decreasing order. Then, the K models with the smallest losses are selected, and their corresponding weights are set to 1. The remaining weights are set to 0. This makes ranking-based pruning methods appealing since they are very fast, easy to implement, and the optimum is easily obtained. One of the earliest ranking-based pruning methods was due to Margineantu and Dietterich who employ the Cohen-Kappa statistic to rate the effectiveness of each classifier (Margineantu and Dietterich 1997). Later, Martinez-Munoz and Suarez propose the use of the cosine similarity to measure how close the ensemble prediction is to the subensemble (Martínez-Muñoz and Suárez 2006). More recent approaches also incorporate the ensemble’s diversity into the selection. Lu et al. propose to measure the individual contribution of each classifier to form a diverse and effective subensemble (Lu et al. 2010) and Guo et al. propose to directly maximize the classification margin, as well as the diversity of the subensemble (Guo et al. 2018).

Mixed Quadratic Integer Programming (MQIP): MQIP-based pruning methods enhance ranking-based methods by also adding a pairwise loss function that measures the relationship between two classifiers \(h_i\) and \(h_j\). Formally, they use the following objective:

$$\begin{aligned} \mathop {\mathrm{arg\,min}}\limits _{w\in \{0,1\}^M}&\quad \frac{1}{N}\sum _{(x,y) \in {\mathcal {S}}} \left( \alpha \sum _{i=1}^M w_i \ell _1(h_i(x),y) + (1 - \alpha )\cdot \sum _{i=1}^M \sum _{j=1}^M w_i w_j \ell _2(h_i(x),h_j(x), y) \right) \text {~st.~} \Vert w \Vert _0 = K \end{aligned}$$
(6)

where \(\alpha \in [0,1]\) models the trade-off between the two losses \(\ell _1 :{\mathbb {R}}^C \times {\mathcal {Y}} \rightarrow {\mathbb {R}}\) and \(\ell _2 :{\mathbb {R}}^C \times {\mathbb {R}}^C \times {\mathcal {Y}} \rightarrow {\mathbb {R}}\). Here, \(\ell _1\) is again a loss function that relates the predictions of each classifier to the true label and \(\ell _2\) is a loss that relates the predictions of two classifiers \(h_i(x)\) and \(h_j(x)\) to each other and potentially also to the true label y. Note that MQIP encapsulates ranking-based methods and recovers them for \(\alpha = 1\). However, also note that solving MQIP problems can be difficult and often takes much more time compared to e.g. ranking-based approaches. Originally this approach was proposed by Zhang et al. (2006) which uses the pairwise errors of each classifier and \(\alpha = 0\) (\(\ell _1\) is not used). Cavalcanti et al. (2016) expand this idea and combine 5 different measures into \(\ell _1\) and \(\ell _2\) including the diversity, correlation, kappa-statistic, disagreement, and double fault measure.

Clustering-based pruning: Another approach to pruning is to first cluster the different models into groups and then select one representative from each group. To do so, let

$$\begin{aligned} H_i = \left( h_i(x_1)_1,\dots ,h_i(x_1)_C,h_i(x_2)_1,\dots ,h_i(x_2)_C, \dots ,h_i(x_N)_1,\dots ,h_i(x_N)_C\right) \end{aligned}$$

denote the (stacked) vector of all predictions of classifier \(h_i\) on the sample \({\mathcal {S}}\) with \(N\cdot C\) entries. Further, let

$$\begin{aligned} c(i) = \mathop {\mathrm{arg\,min}}\limits _{j=1,\dots ,K} \left\{ d(\mu _{j}, H_i) \right\} \end{aligned}$$

be the index of the closest cluster center \(\{\mu _1,\dots ,\mu _K\}\subseteq {\mathbb {R}}^{NC}\) to \(H_i\) given a distance function \(d :{\mathbb {R}}^{NC} \times {\mathbb {R}}^{NC} \rightarrow {\mathbb {R}}_+\). Then, clustering-based pruning formally solves the following optimization problem:

$$\begin{aligned}&\mathop {\mathrm {arg\,min}}\limits _{\begin{array}{c} w\in \{0,1\}^M \\ \mu _1,\dots ,\mu _K \in {\mathbb {R}}^{N C} \end{array}} \frac{1}{N} \sum _{(x,y) \in {\mathcal {S}}} \ell \left( \frac{1}{\Vert w \Vert _0}\sum _{i=1}^M w_i h_i(x), y\right) + \sum _{i=1}^M d(\mu _{c(i)} - H_i) \\ {}&\quad \qquad ~\text{ st. }~ \forall w_i = 1,w_j = 1, i\not =j: c(i) \not = c(j) ~\text{ and }~ \Vert w \Vert _0 = K \nonumber \end{aligned}$$
(7)

Equation 7 has three parts: The first part \(\frac{1}{N}\sum _{(x,y)\in {\mathcal {S}}}\ell \left( \sum _{i=1}^M w_i h_i(x), y\right) \) measures the error of the selected subensemble whereas \(\sum _{i=1}^M d(\mu _{c(i)} - H_i)\) computes the appropriate cluster centers. Finally, the constraints combine both parts to select one representative from each cluster. This optimization problem can be solved with existing clustering algorithms in two steps: First, a clustering is obtained (e.g. by using K-Means (Lazarevic and Obradovic 2001) or Hierarchical Agglomerative Clustering (Giacinto et al. 2000)) and then representatives are selected from each cluster based on the loss \(\ell \). For example, Giacinto et al. (2000) propose to use hierarchical agglomerative clustering using the pairwise error probability as distance. Then, once the clusters have been obtained they select the most distant representatives from each cluster to form a diverse ensemble. Lazarevic and Obradovic (2001) propose to use K-means clustering with the Euclidean distance. In contrast to Giacinto et al., they iteratively remove the least accurate classifier from a cluster until only one classifier is left which is then included in the subensemble. More recent works on cluster-based pruning also directly include the diversity into the distance measure (Zyblewski and Woźniak 2019, 2020).

Ordering-based pruning: Ordering-based pruning orders all ensemble members according to their individual performances as well as their overall contribution to the ensemble and then picks the top K classifier from this list. In this sense, ordering-based approaches are the most general method for ensemble pruning as they allow to directly minimize the ensemble error:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{w\in \{0,1\}^M} \frac{1}{N}\sum _{(x,y) \in {\mathcal {S}}} \ell \left( \frac{1}{\Vert w \Vert _0}\sum _{i=1}^M w_i h_i(x), y\right) ~\text {st.}~ \Vert w \Vert _0 \le K\end{aligned}$$
(8)

where \(\ell :{\mathbb {R}}^C \times {\mathcal {Y}} \rightarrow {\mathbb {R}}\) is again a loss function. To do so, ordering-based approaches sort individual classifiers according to their performance and greedily select the tree that minimizes the overall ensemble error the most. Algorithm 1 depicts the ordering-based optimization approach. First, the classifier with the best individual loss is selected in line 2. Then, line \(4-6\) selects the classifier that minimizes \(\ell \) the most given the already selected ensemble \(\sum _{j=1}^M w_j h_j(x)\). In a sense, ordering-based approaches are greedy, because they select the model which improves the ensemble the most without considering all different combinations. Ordering-based pruning was also first presented by Margineantu and Dietterich (1997) which proposed to greedily minimize the overall ensemble error. A series of works by Martínez-Muñoz and Suárez (2004, 2006) and Martínez-Muñoz et al. (2008) add to this work by proposing different error measures. More recently, theoretical insights from Probably Approximately Correct Learning (PAC) theory and the bias-variance decomposition were also transformed into greedy pruning approaches (Li et al. 2012; Jiang et al. 2017).

figure a

2.2 Leaf-refinement

Looking beyond ensemble pruning itself there are numerous orthogonal methods to deploy ensembles to small devices. First, ‘classic’ decision tree pruning algorithms (e.g. minimal cost complexity pruning or sample complexity pruning, c.f. Barros et al. 2015) and more recent adaptations, such as cost-complexity forest pruning (Ravi and Serra 2017) already reduce the size of DTs while offering a better accuracy. Second, in the context of model compression (see e.g. Choudhary et al. 2020 for an overview) specific models such as Bonsai (Kumar et al. 2017), Decision Jungles (Shotton et al. 2013) or X-CLEaVER (Lucchese et al. 2018) aim to find smaller tree ensembles already during training, sometimes involving pruning as a sub-procedure.

One particular interesting method called leaf-refinement refines the probability estimates in the leaf nodes of each tree by using a global loss that exploits complementary information between multiple trees (Ren et al. 2015; Buschjäger and Morik 2021). Since we can incorporate the ensemble weights into the leaf nodes as described above, leaf-refinement is a generalization of the re-weighting of ensembles (Akash et al. 2019; Shahhosseini et al. 2022; Shahhosseini and Hu 2020) making it a very general framework for improving tree ensembles. Formally, let \(\theta _i = ({{\widehat{y}}}_{i,1},\dots ,{{\widehat{y}}}_{i,L_i})\) be the probability estimates of all leaf nodes in tree \(h_i\) and let \(h_{i, \theta _i}(x)\) denote the prediction of tree i using the probability estimates \(\theta _i\). Further, let \(\theta = [\theta _1,\dots ,\theta _M]\) be the matrix of all probability estimates of all trees in the ensemble and let \(f_{\theta }(x)\) denote the prediction of the ensemble with estimates \(\theta \). Then, leaf-refinement proposes to minimize a global loss function

$$\begin{aligned} \theta = \mathop {\mathrm{arg\,min}}\limits _{\theta _1,\dots ,\theta _M} \frac{1}{N}\sum _{(x_i,y_i)\in {\mathcal {S}}} \ell \left( \frac{1}{M}\sum _{j=1} h_{j, \theta _j}(x_i), y_i\right) \end{aligned}$$
(9)

This global loss takes into account all the interactions between individual trees to refine the probability estimates in the leaves, but it does not change the structure of individual trees. Hence, it can be easily minimized by stochastic gradient descent (SGD). SGD is an iterative algorithm that takes a small step into the negative direction of the gradient in each iteration by using an estimate of the true gradient:

$$\begin{aligned} \theta \leftarrow \theta - \alpha g_{{\mathcal {B}}}(\theta ) \end{aligned}$$
(10)

where \(g_{{\mathcal {B}}}(\theta )\) is the gradient of \(\ell \) w.r.t. to \(\theta \) computed on a mini-batch \({\mathcal {B}}\) and \(\alpha \in {\mathbb {R}}_+\) is the step-size. The gradients for the individual entries \(\theta _i\) are given by the chain rule:

$$\begin{aligned} g_{{\mathcal {B}}}(\theta _i) = \frac{1}{ \vert {\mathcal {B}} \vert } \left( \sum _{(x,y)\in {\mathcal {B}}} \frac{\partial \ell (f_{\theta }(x), y)}{\partial f_{\theta }(x)} \frac{1}{M} s_{i,l}(x)\right) _{l=1,2,\dots ,L_i} \end{aligned}$$
(11)
figure b

Algorithm 2 summarizes the Leaf-Refinement (LR) algorithm. First, the original probability estimates from the tree in the leaf nodes are used as an initialization for the parameter vectors \(\theta _i\) in line 2. Then, SGD is performed for E epochs using Eqs. (10) and (11) in lines 4–10. Here, one epoch refers to one linear scan over the entire dataset in which batches are processed so that each example occurs exactly in one batch during each epoch.

The specific choice for the loss functions differs in the literature. Ren et al. (2015) propose to use the hinge-loss in combination with a \(L_2\) regularization term similar to the SVM. Let \(\lambda \in {\mathbb {R}}_+\) be a regularization strength, then they propose to minimize

$$\begin{aligned} \ell _{\lambda }(f_{\theta }(x), y) = \lambda \cdot \max (0, 1-f_{\theta }(x) \cdot y) + \frac{1}{2}\Vert \theta \Vert ^2_2 \end{aligned}$$
(12)

where \(\Vert \cdot \Vert ^2_2\) is the \(L_2\) norm introduced to combat overfitting.

Buschjäger and Morik (2021) adapt the negative correlation learning algorithm (NCL) from the context of neural network training for leaf-refinement to enforce different levels of diversity. NCL is based on the bias-(co-)variance decomposition that is then transformed into a regularized learning objective (c.f. Brown et al. 2005). Again, let \(\lambda \in {\mathbb {R}}_+\) be the regularization strength then they propose to minimize

$$\begin{aligned} \ell _{\lambda }(f_{\theta }(x), y) = \frac{1}{M} \sum _{i=1}^M (h_{i,\theta _i}(x)-y)^2 - \frac{\lambda }{2M} \sum _{i=1}^M {d_i}^T D d_i \end{aligned}$$
(13)

where \(d_i = (h_{i,\theta _i}(x) - f(x))\), \(D = 2 \cdot I_C\) is the \(C\times C\) identity matrix with 2 on the main diagonal and \(\lambda \in {\mathbb {R}}_+\) is the regularization strength. For \(\lambda = 0\) this trains M classifier independently and no further diversity among the ensemble members is enforced, for \(\lambda > 0\) more diversity is enforced during training and for \(\lambda < 0\) diversity is discouraged.

3 Combining leaf-refinement and ensemble pruning

leaf-refinement as well as ensemble pruning enable better and smaller tree ensembles. However, both approaches tackle this challenge from different points of view. Ensemble pruning removes entire trees from the ensemble to reduce its memory consumption and, as a by-product, improves its predictive performance. leaf-refinement on the other hand, refines the probability estimates in the trees to improve the performance and, as a byproduct, enables the use of smaller forests with similar performance.

This leads to two questions: First, which of the two methods is better suited to deploy tree ensembles to small devices? Second, can we combine both methods to further improve the predictive performance of the forest while having a smaller memory consumption at the same time? In this section, we present a method that combines leaf-refinement with ensemble pruning to compute a small and powerful ensemble at the same time.

Arguably, the simplest method to combine both approaches is to first prune the ensemble and then refine it afterward. However, this method does not consider the interactions between the pruning algorithm and leaf refinement. It is conceivable that pruning would select different trees if the probability estimates had been refined before the pruning process. Similarly, it is conceivable that refinement would compute different leaf values if it had been performed on the unpruned ensemble. We advocate that the selection of trees, as well as the refinement of the corresponding leaf values, should be performed simultaneously to find the overall smallest and best ensemble. The key challenge in this scenario is to incorporate the selection of trees into the gradient-based approach of leaf-refinement. In ensemble pruning each tree either receives weight 0 (not selected) or 1 (selected). Unfortunately, it is difficult to optimize over discrete values \(\{0,1\}^M\) with gradient-based approaches because we apply small, non-binary changes to the weights during optimization. One possible approach to solve this dilemma is to relax the constraints and optimize over real-valued weights \(w\in {\mathbb {R}}^M\) in combination with an \(L_1\) regularization penalty that enforces sparsity:

$$\begin{aligned} \theta , w = \mathop {\mathrm{arg\,min}}\limits _{\theta , w\in {\mathbb {R}}^M} \frac{1}{N}\sum _{(x_i,y_i)\in {\mathcal {S}}} \ell \left( \sum _{j=1}^M w_j h_{j, \theta _j}(x_i), y_i\right) + \lambda \Vert w \Vert _1 \end{aligned}$$
(14)

Enforcing sparsity through a \(L_1\) regularization has a long history in Data Mining and Machine Learning. Arguably the largest application of it can be found in feature selection via the LASSO and related methods (see e.g. Tibshirani 1996; Li et al. 2017), but also other application areas such as Matrix Factorization (Kumar and Sindhwani 2015), Neural Network Pruning (Li et al. 2016), Dictionary Learning (Jiang et al. 2015) have been explored.

Objective 14 is non-smooth due to the \(L_1\) norm and hence cannot be minimized via SGD directly. Stochastic proximal gradient descent (SPGD) is an adaption of SGD that incorporates a projection operation into the updates so that it can cope with non-smooth objectives (Parikh and Boyd 2014). SPGD is an iterative algorithm, where every iteration consists of two steps: first, a gradient descent update of the objective function is performed without considering its non-smooth part (e.g. ignoring the \(L_1\) regularizer). Then, a projection operator (sometimes called prox) is applied to project the updated parameters onto the correct solution considering the non-smooth part of the objective. Let w be the weight vector at step t and let \(g_{{\mathcal {B}}}(w)_i\) be the gradient of the \(i-th\) entry in w without considering the \(L_1\) term. Furthermore, let \({\mathcal {P}}_{\alpha }\) be the prox operator of \(\lambda \Vert w \Vert _1\) with step size \(\alpha \), then PSGD performs the following updates

$$\begin{aligned} w \leftarrow {\mathcal {P}}_{\alpha } \left( w - \alpha g_{{\mathcal {B}}}(w) \right) \end{aligned}$$
(15)

using the gradient via the chain-rule

$$\begin{aligned} g_{{\mathcal {B}}}(w) = \frac{1}{ \vert {\mathcal {B}} \vert } \left( \sum _{(x,y)\in {\mathcal {B}}} \frac{\partial \ell (f_{w, \theta }(x), y)}{\partial f_{w, \theta }(x)} h_{i,\theta _i}(x)\right) _{i=1,\dots ,M} \end{aligned}$$
(16)

and the prox \({\mathcal {P}}_{\alpha } :{\mathbb {R}}^M \rightarrow {\mathbb {R}}^M\) (Parikh and Boyd 2014):

$$\begin{aligned} {\mathcal {P}}_{\alpha } \left( w \right) = \left( {{\,\textrm{sign}\,}}(w_i) \max (|w |- \lambda \alpha , 0)\right) _{i=1,\dots ,M} \end{aligned}$$
(17)

Since there is no regularizer for the leaf nodes we can directly minimize the objective wrt. \(\theta \) without using the prox. In this case, the gradient for \(h_i\) now also contains its weights (again using the chain rule):

$$\begin{aligned} g_{{\mathcal {B}}}(\theta _i) = \frac{1}{ \vert {\mathcal {B}} \vert } \left( \sum _{(x,y)\in {\mathcal {B}}} \frac{\partial \ell (f_{w,\theta }(x), y)}{\partial f_{w,\theta }(x)} w_i s_{i,l}(x)\right) _{l=1,2,\dots ,L_i} \end{aligned}$$
(18)

Algorithm 3 summarizes this approach. Similar to before the probability estimates in the leaf nodes are used as an initialization for the parameter vectors \(\theta _i\) in line 2. Then, PSGD is performed for E epochs using Eqs. (16), (18), and (17). To do so, the gradient for each weight \(g_{{\mathcal {B}}}(w)_i\) is computed, and a regular weight update is performed in line 7. Similarly, the gradient for the leaf nodes of each tree \(g_{{\mathcal {B}}}(\theta _i)\) is computed in line 8 and a regular gradient descent update is performed. After the leaf nodes of each tree as well as its weights have been updated the prox operator is applied in line 10. For \(\lambda > 0\) we call this algorithm leaf-refinement with \(L_1\) regularization (L1 + LR). Setting \(\lambda = 0\) and ignoring any weight updates (line 7) recovers the original leaf-refinement (LR) algorithm. Similarly, ignoring any updates for the leaf nodes in line 8 yields a new pruning algorithm that selects trees purely based on the \(L_1\) norm which we call \(L_1\) pruning.

figure c

4 Experiments

In this section, we experimentally evaluate the combination of leaf-refinement and pruning (L1 + LR) and compare its performance with vanilla Random Forests, pruned RFs, and vanilla Leaf refinement in the context of IoT. As argued before, our main concern is the final model size as it determines the resource consumption, runtime, and energy of the model application during deployment (Buschjäger and Morik 2017; Buschjäger et al. 2018). Hence, we adopt a hardware-agnostic view and ask the following three questions:

  • Question 1 What method has the best predictive performance?

  • Question 2 What method has the best predictive performance under memory constraints?

  • Question 3 How do these methods behave in a real-world use case?

Table 2 The methods and their corresponding hyperparameters

An overview of all the hyperparameters for our experiments is given in Table 2. We use the following experimental protocol: The basic idea of ensemble pruning is to first overtrain the ensemble and then remove unnecessary classifiers from this overtrained pool. Oshiro et al. studied the impact of the number of trees on the performance of a regular RF and showed on a variety of datasets that there is no significant performance improvement when using more than 128 trees (Oshiro et al. 2012). Therefore, we ‘overtrain’ our base Random Forests with \(M = 256\) trees to increase the classifier pool for pruning without increasing the training time significantly. To control the individual errors of trees, we set the maximum number of leaf nodes \(n_l\) to values between \(n_l \in \{16, 32, 64, 128,256,512,1024\}\). From greedy pruning methods, we use complementariness pruning (COMP) (Martínez-Muñoz and Suárez 2004), reduced error pruning (RE) (Margineantu and Dietterich 1997), and DREP (Li et al. 2012). COMP uses complementariness, i.e., the number of examples on which an estimator disagrees with the ensemble’s prediction but is correct to rate each member of the ensemble. RE uses the error of the subensemble to rate each estimator, whereas DREP uses a PAC-style bound to rate each classifier. For cluster-based pruning we utilize largest mean distance (LMD) pruning (Giacinto et al. 2000) that first builds an agglomerative clustering of the estimators’ accuracies and then selects those estimators that are the farthest away from each cluster into the new ensemble. For rank-based pruning, we employ individual error (IE) pruning (Lu et al. 2010), and individual contribution (IC) pruning (Jiang et al. 2017). IE uses the individual error of each estimator, whereas IC computes the individual contributions to the ensemble’s prediction. Last, we also experimented with MIQP-based pruning (Zhang et al. 2006), but unfortunately, the MIQP optimizer would frequently fail or time-out during experimentsFootnote 1 Each pruning method is tasked to select \(K\in \{2,4,8,16,32,64,128\}\) trees from the base forest. For DREP we additionally vary \(\rho \in \{0.25,0.3,0.35,0.4,0.45,0.5\}\). During the development of our leaf-refinement method, we found that 50 epochs in combination with a batch size of 1024 minimizing the MSE loss works well on a variety of datasets. Hence, for leaf-refinement, we randomly select \(K\in \{2,4,8,16,32,64,128\}\) trees from the random forests (which is similar to training a smaller forest directly) and minimize the MSE loss for 50 epochs with a batch size of 1024 using the Adam optimizer (Kingma and Ba 2015) implemented in PyTorch (Paszke et al. 2019). Recall that our L1 + LR method indirectly controls the number of trees in the forest through the regularization strength \(\lambda \in \{1,0.5,0.1,0.05,0.01\}\). As discussed previously, we study two variations of our algorithm. In the first version, we do not perform any leaf-refinement, but only select trees using the \(L_1\) norm and call this algorithm L1. In the second version, we combine leaf-refinement with the \(L_1\) regularization as outlined in Algorithm 3 and call this algorithm L1 + LR. For our experiments, we use 20 publicly available classification datasets with 6435 to 78,095 examples as outlined in Table 3. Here, N denotes the total number of data points, d is the dimensionality, and C is the number of classes ranging from 2 to 11. The class distribution is also given for each dataset and each class. A dash ‘–’ indicates that the corresponding dataset has fewer classes, e.g., adult has only two classes, and hence entries for \(C_2\)\(C_{17}\) are marked with a dash. In all experiments, we perform a 5-fold cross-validation except when the dataset has a dedicated train/test split in which case we perform five repetitions of the experiment using different random seeds. We use the training set for both, training the initial forest and pruning it. For a fair comparison, we made sure that each method receives the same forest in each cross-validation run. In all experiments, we use minimal preprocessing and encode categorical features as one-hot encoding. The random forests have been trained with scikit-learn (Pedregosa et al. 2011). We implemented all pruning algorithms in a Python package for other researchers called PyPruning, which is available under https://github.com/sbuschjaeger/PyPruning. The code for the experiments in this paper is available under https://github.com/sbuschjaeger/leaf-refinement-experiments. In total, we evaluated 660 hyperparameter configurations per dataset, leading to a total of 13, 200 experiments.

Table 3 Datasets used for the experiments

4.1 What method has the best predictive performance?

In the first experiment, we study the predictive performance of pruning and leaf-refinement without considering any memory constraints. To do so, we pick the hyperparameter configuration of each method that has the best predictive performance. To account for imbalanced datasets (e.g. ida2016) we study the predictive performance in terms of accuracy and \(F_1\) score.

Table 4 shows the accuracy of each method on each dataset with the corresponding model size. For datasets without a dedicated train/test split we report the average accuracy and its standard deviation over the cross-validation folds. For datasets with a dedicated train/test split, we repeat the experiments with 5 different random seeds and report the average accuracy and its standard deviation over these repetitions. The highest accuracy is marked in bold. It can be clearly seen that the combination of leaf-refinement and L1 regularization (LR + L1) offers the best accuracy on 13 datasets (adult, avila, bank, chess, connect, eeg, elec, fashion, har, ida2016, mnist, mozilla, statlog) and is tied for the first place on 3 datasets (anuran, magic postures). LR is the best method on gas-drift and nursery, whereas RF ranks first on jm1. As was to be expected, Random Forest seems to underperform on most datasets and improvements are possible due to leaf refinement or pruning. However, it is also noteworthy that large improvements seem only to be possible with refinement and not with pruning. For example, RF only achieves \(76.23 \%\) accuracy on the connect dataset and L1 + LR achieves up to \(84.16 \%\) whereas the best pruning method (here L1) achieves \(71.86 \%\) accuracy. Table 5 shows the \(F_1\) score for each method in each dataset. Again, for datasets without a dedicated train/test split we report the average \(F_1\) score and its standard deviation over the cross-validation folds. For datasets with a dedicated train/test split, we repeat the experiments with 5 different random seeds and report the average \(F_1\) score and its standard deviation over these repetitions. The best method is marked in bold. Similar to before, L1 + LR ranks first on 14 datasets (adult, avila, bank, chess, connect, eeg, elec, fashion, har, ida2016, jm1, magic, mozilla, statlog) and is tied for first place on four datasets (anuran, mnist, nursery, postures) with LR. LR is the best method on two datasets (gas-drift, japanese-vowels). Interestingly, L1 + LR now also ranks first on the bank and jm1 datasets using the \(F_1\) score which was not the case for the accuracy. We explain this behavior with the more imbalanced class distribution of these datasets. As expected, the model size greatly varies between data sets in both tables, but there is also a sizable difference between the individual methods. RF has arguably the largest models, followed by the various pruning methods, whereas LR, as well as L1 + LR, seem to have the smallest models, although it is difficult to give a general recommendation here. We will examine the model size in more detail in the next section.

Table 4 The accuracy and model size of each method on each dataset
Table 5 The \(F_1\) score and model size of each method on each dataset

To give a statistically meaningful comparison we present the results in Tables 4 and 5 as a CD diagram (Demšar 2006). In a CD diagram, each method is ranked according to its performance, and a Friedman-Test is used to determine if there is a statistical difference between the average rank of each method. If this is the case, then a pairwise Wilcoxon-Test between all methods checks whether there is a statistical difference between two classifiers. CD diagrams visualize this evaluation by plotting the average rank of each method on the x-axis and connecting all classifiers whose performances are statistically similar via a horizontal bar. Figure 1 shows the corresponding CD diagram for the accuracy (left side) and \(F_1\) score (right side), where \(p=0.95\) was used for all statistical tests. In both cases, we see that L1 + LR ranks first, followed by LR. L1 + LR and LR are the statistically significant best methods. With some distance, L1, COMP, IC, RE, IE, RF, DREP, and LMD follow. Random Forest, LMD, and DREP are generally ranked last, whereas IE, RE, COMP, and L1 form one (for the accuracy) and two (in the case of the \(F_1\) score) cliques ranked in the middle. We conclude that pruning and leaf-refinement improve the accuracy over the base Random Forest in almost all cases, confirming the results in the literature. However, leaf-refinement seems to perform better than pruning, and larger improvements in terms of accuracy and \(F_1\) are possible when leaf values are refined. Last, the joint selection and refinement of trees via the L1 + LR algorithm seem to generally perform best ranking first in both cases, thereby supporting our initial hypothesis that both pruning and refinement should be integrated into each other for the best performance.

Fig. 1
figure 1

CD-diagram for the accuracy (left side) and \(F_1\) score (right side) of the different methods over multiple datasets. For all statistical tests, \(p=0.95\) was used. More to the right (lower rank) is better. The methods in connected cliques are statistically similar

4.2 What method has the best predictive performance under memory constraints?

In the second experiment, we study the predictive performance of pruning and leaf-refinement under memory constraints. Recall that small IoT devices are often severely limited in terms of memory (c.f. Table 1) and we can only deploy models that fit the available memory. For our analysis, we adopt a hardware-agnostic view which assumes that we are given a fixed memory budget for our model, which should, naturally, maintain a state-of-the-art performance. To do so, we pick the hyperparameter configuration of each method that has the best predictive performance while having a model size smaller than \(\{256, 512, 768, 1024, 2048\}\) KB. The size of the model is computed as follows: A baseline implementation of DTs stores each node in an array and iterates over it. Each node inside the array requires a pointer to the left / right child (8 bytes in total assuming int is used), a boolean flag if it is a leaf-node (1 byte), the feature index as well as the threshold to compare the feature against (8 bytes assuming int and float is used). Finally, entries for class probabilities are required for the leaf nodes (4 bytes per class, assuming that float is used). Thus, in total, a single node requires \(17+4\cdot C\) Bytes per node, which we sum over all nodes in the entire ensemble (Buschjäger et al. 2018).

Table 6 The accuracy of each method on each dataset with a model size below 256 KB
Table 7 The accuracy of each method on each dataset with a model size below 768 KB

We could not find meaningful differences between the \(F_1\) score and the accuracy, and hence we will focus on the accuracy for now and revisit the \(F_1\) score later on. Moreover, we will focus on \(\{256, 768, 2048\}\) KB constraints. Additional tables with additional memory constraints, as well as the \(F_1\) score, are given in the appendix. Table 6 shows the accuracy for model sizes below 256 KB. Contrary to the accuracies without any memory constraints, this table is now more fragmented. L1 + LR is the best method on 4 datasets (adult, har, nursery, statlog), whereas vanilla LR ranks first on 10 datasets (anuran, chess, connect, eeg, elec fashion, gas-drift, japanese-vowels, mnist postures). RE pruning is the best option on two datasets (avila, mozilla), IC is the best option on the jm1 dataset, COMP is the best algorithm on the magic dataset, and IC and COMP are the best options on the jda2016 dataset. Somewhat surprisingly, pruning via L1 did not lead to valid models on any dataset, whereas L1 + LR produces valid models on 12 datasets. We suspect that L1 and L1 + LR require different values for \(\lambda \) to select a similar amount of trees. We investigate this phenomenon in more detail in the next section. Going from 256 KB constraints to 768 KB constraints in Table 7, L1 + LR seems to improve. It now ranks first on 7 datasets (adult, anuran, bank, eeg, har, magic, statlog), followed by LR, which ranks first on 10 datasets (chess,connect,elec, fashion, gas-drift, ida2016, japanese-vowels, mnist, nursery, postures) and IE that ranks first on one dataset (avila). L1 + LR and IE share the first place on the mozilla dataset, and IC ranks first on one dataset (jm1). This trend continues for larger memory sizes as depicted in Table 8. Here, L1 + LR now ranks first on 12 datasets with a performance close to that of the unconstrained ones in Table 4. LR ranks first on 7 datasets, and IC ranks first on one dataset.

Table 8 The accuracy of each method on each dataset with a model size below 768 KB

We conclude that for small model sizes below 256 KB, pruning and refinement offer better predictive performance than a vanilla random forest, but it is difficult to give a clear recommendation of what method works best in this scenario. We hypothesize that due to the small model size, each method can only pick a few comparably small trees all with similar performance, and hence we find similar performances across the methods. Furthermore, LR seems to perform slightly better than L1 + LR. Once more memory is available, each method can pick more and larger trees, thereby leaving more room for picking ‘good’ and ‘bad’ trees. Hence, we see more differences between the individual methods and a clear trend toward refinement. Finally, for larger models with 2048 KB constraints, there is a clear trend towards L1 + LR for the best performance.

The difference between vanilla LR and L1 + LR for smaller model sizes can be explained by the choice of hyperparameters in this experiment. LR considers \(K \in \{2,4,8,16,32,64,128\}\) trees for refinement, whereas L1 + LR indirectly chooses the number of trees via \(\lambda \in \{0.1,0.2,\dots ,0.9,0.925,0.955,0.975,1.0\}\). We suspect that a more fine-grained selection for values of \(\lambda \) would have led to a more fine-grained distribution of different models with potentially better performance. Figure 2 shows the average number of estimators across all datasets and all configurations selected for different \(\lambda \) values in L1 + LR. The error band shows the standard deviation. As expected, increasing \(\lambda \) leads to a reduction in the number of trees. Between \(\lambda = 0.1\) and \(\lambda = 0.9\), there is a large, almost linear drop in the number of estimators from more than 200 to just under 100. An even steeper drop occurs for \(\lambda > 0.9\), but the number of estimators remains above 50 on average, even for \(\lambda = 1.0\). Hence, it is conceivable that choosing additional values \(\lambda \in [0.9,1.0]\), and maybe even values \(\lambda > 1\) would lead to a selection of even fewer trees below 50, and, therefore, leading to better performance for model sizes below 256 KB.

Fig. 2
figure 2

Average number of estimators across all datasets and configuration of L1 + LR for different \(\lambda \) values. The error band shows the standard deviation

Fig. 3
figure 3

Average accuracy (left column) and the average \(F_1\) score (right column) across a different number of estimators for the chess dataset (top row, average is computed over the cross-validation folds) and eeg dataset (bottom row, average is computed over the cross-validation folds)

To further study this phenomenon, we investigate the performance of pruning and leaf-refinement over the number of trees directly. Figure 3 shows the average test accuracy (left column) and average \(F_1\) score (right column) on the chess dataset (top row, average is computed over the cross-validation folds) and eeg dataset (bottom row, average is computed over the cross-validation folds). Please note that we found a similar behavior on the other datasets and hence decided to focus on these two datasets as they show the most distinctive behavior. On the chess dataset, one can see that the L1 + LR never selects less than just under 100 trees, indicating that a more careful choice for \(\lambda \) would have been necessary to select fewer trees. Here, LR with fewer estimators gives a better trade-off between the accuracy (and \(F_1\) score) and the number of trees. Last, we see how pruning can improve the performance over a regular RF: All pruning methods improve over the vanilla RF between 16 and 64 trees and then slowly converge to the original RF’s performance. In all cases, methods with leaf-refinement outperform pruning. Looking at the eeg dataset (bottom row), we see a slightly different picture. Here, L1 + LR selects ensembles with 16 to 256 trees indicating that \(\lambda \) was better fitting for the problem. Contrary to before, we find that regular leaf-refinement does not perform well for smaller ensembles on this dataset and is outperformed by L1 + LR for ensembles with 200 trees or less.

Fig. 4
figure 4

2 CD-diagram for the accuracy and \(\{256, 512, 768, 1024, 2048\}\) KB memory constraints. For all statistical tests, \(p=0.95\) was used. More to the right (lower rank) is better. The Methods in connected cliques are statistically similar

Similar to the previous section, we want to give a more statistical overview of our findings using CD diagrams. To do so, we expand them into two-dimensional CD diagrams where we apply memory constraints for each level on the y-axis. In the first level, we apply very restrictive constraints, only allowing for models below 256 KB, and plot the average rank of each method. This will likely result in small ensembles of small trees. On the next level, we double the amount of memory allowed to 512 KB and again plot the average rank similar to a ‘regular’ CD diagram. We repeat this process for all constraints and plot 5 levels with \(\{256, 512, 768, 1024, 2048\}\) KB constraints. Figure 4 shows the CD diagram for accuracy. As indicated by the previous discussion, all methods are relatively close to each other if only limited memory is available. LR is the best method, followed by IC, COMP, RE, RE, L1 + LR, DREP, LMD, RF, and L1 for 256 KB. As discussed previously, L1 on its own is the worst method for all memory constraints. Going to 512 KB constraints, we see that the methods begin to differentiate more but keep their relative ranking. For 765 KB constraints, L1 + LR starts to move up the ranks, now ranking second place, and for 1024 KB and for 2048 KB constraints, it becomes the best method, ranking first place. Figure 5 shows the CD diagram for the \(F_1\) score. The overall plot is similar to Fig. 4: If limited memory is available, then it becomes more difficult to distinguish the performance of single methods, whereas, with more memory available, the average ranks seem to differentiate more. Moreover, L1 is the worst method overall, whereas LR is the best method for 256 and 512 KB constraints, and L1 + LR is the best method for 1024–2048 KB constraints. For 768 KB constraints, there is no clear winner, although LR seems to rank slightly better than L1 + LR.

Fig. 5
figure 5

2 CD-diagram for the \(F_1\) score and \(\{256, 512, 768, 1024, 2048\}\) KB memory constraints. For all statistical tests, \(p=0.95\) was used. More to the right (lower rank) is better. The methods in connected cliques are statistically similar

4.3 Case-study for the PhyNetLab

To showcase the effectiveness of our approach, we will now compare the performance of ensemble pruning, and leaf-refinement in the context of the PhyNetLab warehouse (Masoudinejad et al. 2018). The PhyNetLab is a hardware test platform for the evaluation and analysis of IoT-based warehouses. It consists of small, ultra-low power, energy-neutral devices called PhyNodes that are placed on storage boxes inside the warehouse. The nodes are connected to various access points and form a wireless sensor network. Each node measures the current light intensity, the current temperature, its acceleration as well as the WiFi signal strength to the access points in the warehouse. The goal is to estimate the current position of each node and, thereby, allow for efficient routing and detection of storage boxes in the warehouse. While machine learning is ideally suited for such a task, the challenge lies in the deployment of models. The PhyNode has a MSP430 MCU with a total of 64 KB of Ferroelectric Random Access Memory (FRAM) available, of which 48 KB are accessible by the compact instruction set. Roughly one-third of this memory is already used for the operating system and drivers, leaving about 30 KB of memory for the top-level application, including the model. Subtracting an additional top-level application code of around 10 KB leaves roughly 20 KB for the localization model (Masoudinejad et al. 2018). Therefore, our goal is to find the best localization model that still fits into the remaining 20 KB.

During 42 experiments conducted at various light and temperature levels, a total of 41, 431 measurements at 31 different locations inside the warehouse have been taken. Each measurement consists of the acceleration (X,Y,Z) of the box, the current temperature, the current light intensity, as well as the WiFi signal strength to 3 different access points and a unique identifier for each box. During earlier experiments, we noted that the acceleration can have a huge impact on the performance because, in some experiments, the boxes would not be leveled, introducing biases into the acceleration. Hence, the model would over-fit on this feature, although by design, the acceleration of a (standing) box should not impact the performance of the classification. Hence, we ignore the acceleration in this experiment. To further reduce overfitting against specific environmental properties (e.g., a particular shiny or warm day), we train the models on the data from the first 41 experiments and test them on the last experiment. The resulting training data has \(N_{train} = 40{,}444\) samples with \(d = 6\) features, \(C = 31\) classes, and the test set contains \(N_{test} = 987\) test samplesFootnote 2.

Recall that the model must fit into 20 KB of memory. The size of the implemented model is highly dependent on the specific implementation and can vary across models, MCUs, and implementations. Hence, we perform a two-step process to find good models that fit into 20 KB of RAM: First, we train small models that approximately fit into the memory of the PhyNode. To do so, we estimate the size of the model during training by again counting the total number of nodes \(n_{total}\) in all trees inside the forest and then by computing the size via \((17+4\cdot C ) \cdot n_{total}\) as outlined above. In the second step, we FastInferenceFootnote 3 to generate the implementation of these models, automatically compile them and remove all models that result in an overflow during the compilation, thereby leaving only models that can actually be deployed to the PhyNode. FastInference is a model compiler that generates model- and CPU-specific inference code for various machine-learning models such as Decision Trees and Random Forests. To do so, FastInference runs in four steps (c.f. Fig. 6): In the first step, the model is loaded from a file into an internal representation. Then, the model is optimized, e.g., by converting floats to a fixed-point quantization, by pruning decision treesFootnote 4 etc. Third, the user chooses a backend that determines the target CPU’s properties, such as cache size, available memory, etc., as well as the desired implementation. With this information, FastInference re-structures the optimized model such that it can be expressed through a combination of different code templates, which is then realized by a template engine in step 4. The output of this operation is the C++ inferencing code of an optimized model that can be easily integrated into the compilation toolchain for deployment. FastInference offers two different types of tree implementations, namely native trees that iterate over a static array of nodes using a while-loop and if-else trees that decompose the DT into its if-else structure (c.f. Buschjäger et al. 2018). Unfortunately, if-else trees result in very large code sizes and hence would use too much memory during compilation. Therefore, we chose native trees for this experiment. Some additional pre-processing was required to make the models fit into 20 KB: Recall that there are \(C=31\) classes and, hence, a tree with 16 leaf nodes requires 2 KB to store the class probabilities in leaf nodes if a float variable is used. To reduce the memory consumption, we, therefore, employed a fixed-point quantization that scales each probability by a factor of 10,000 and rounds it down towards the next integer. In this way, the probabilities in each leaf node can be stored within a 2 Byte short variable, effectively halving the size. This operation is also implemented in FastInference, and we could not detect any change in the accuracy with this quantization.

Fig. 6
figure 6

Workflow of the FastInference model compiler

In a series of pre-experiments, we determined reasonable ranges for the hyperparameters of each algorithm so that the estimated model size is below 24 KB. Similar to before, we train a base Random Forest with \(M=256\) trees and \(n_l\in \{4,8,12\}\) leaf nodes. Each pruning method is tasked to select \(K \in \{2,4,8\}\) trees. For DREP, we used \(\rho \in \{0.25,0.3,\dots ,0.5\}\). For L1 and L1 + LR, we minimized the MSE over 20 epochs with the Adam optimizer using \(\alpha = 0.01\), \(|{\mathcal {B}} |= 1024\) and \(\lambda \in \{1.0465, 1.0466, \dots , 1.047\}\). Table 9 shows the accuracy and \(F_1\) score for the best models that could still fit in the PhyNode. As can be seen, L1 + LR offers the best predictive accuracy as well as the best \(F_1\) score, highlighting the usefulness of our approach. Moreover, we found that the accuracy seems to vary a lot between the different methods. For example, DREP is the worst method with \(51.32 \%\) accuracy, whereas L1 + LR is nearly 20 percentage points better with an accuracy of \(71.04 \%\). Given that all models are derived from the RF, these large differences seem surprising to us, but we could not find any errors in our evaluation pipeline. In particular, we made sure that all methods receive the same base forest so that no re-training of the forest would occur.

Table 9 Accuracy (rounded to the second decimal digit) and \(F_1\) score (rounded to the fourth decimal digit) of the best model per method that can still fit into the memory of the PhyNode

5 Conclusion

Ensemble algorithms are among the state-of-the-art in many machine learning applications. With the ongoing integration of ML models into everyday life, the deployment and continuous application of models become more and more important issues. By today’s standard, large Random Forests are trained for the best performance which can challenge the resources of small devices and sometimes make deployment impossible. Various techniques have been proposed in the literature that try to reduce the memory consumption of tree ensembles while potentially increasing their performance. In this paper, we studied two common techniques namely ensemble pruning and leaf-refinement. Ensemble pruning removes unnecessary classifiers from the ensemble to reduce the overall resource consumption while potentially improving its accuracy. Leaf-refinement, on the other hand, refines the probability estimates of the trees inside the ensemble by minimizing a global loss. In this paper, we combined both approaches into a single objective and presented an efficient algorithm to optimize it. Our L1 + LR method performs pruning by minimizing an \(L_1\)-regularized loss via proximal gradient descent while refining the probability estimates in the leaf nodes at the same time. In a series of 13, 200 experiments on 20 publicly available datasets, we showed that L1 + LR has the statistically significant best accuracy and the statistically significant best \(F_1\) score compared with 8 state-of-the-art methods. Moreover, we detailed how these algorithms behave under different memory constraints. We found that if only a very limited amount of memory is available, then L1 + LR and leaf-refinement behave similarly offering better performance than ensemble pruning. If more memory is available, then L1 + LR seems to dominate over vanilla LR making it the overall best choice. Last, we highlighted the usefulness of our approach in a case study using the PhyNetLab. We discussed how to train, prune, and implement small tree ensembles using the FastInference tool and showed how to effectively deploy our models to ultra-low power devices such as the PhyNode.