Neural Random Forest Imitation

We present Neural Random Forest Imitation - a novel approach for transforming random forests into neural networks. Existing methods propose a direct mapping and produce very inefficient architectures. In this work, we introduce an imitation learning approach by generating training data from a random forest and learning a neural network that imitates its behavior. This implicit transformation creates very efficient neural networks that learn the decision boundaries of a random forest. The generated model is differentiable, can be used as a warm start for fine-tuning, and enables end-to-end optimization. Experiments on several real-world benchmark datasets demonstrate superior performance, especially when training with very few training examples. Compared to state-of-the-art methods, we significantly reduce the number of network parameters while achieving the same or even improved accuracy due to better generalization.


Introduction
Neural networks have become very popular in many areas, such as computer vision (Krizhevsky et al., 2012;Reinders et al., 2022;Ren et al., 2015;Simonyan & Zisserman, 2015;Zhao et al., 2017;Qiao et al., 2021;Rudolph et al., 2022;Sun et al., 2021), speech recognition (Graves et al., 2013;Park et al., 2019;Sun et al., 2021), automated game-playing (Mnih et al., 2015;Dockhorn et al., 2017), or natural language processing (Collobert et al., 2011;Sutskever et al., 2014;Otter et al., 2021).Researchers have published many datasets for training neural networks and put enormous effort into providing labels for each data sample.For realworld applications, the dependency on large amounts of labeled data represents a significant limitation (Breiman et al., 1984;Hekler et al., 2019;Barz & Denzler, 2020;Qi & Luo, 2020;Phoo & Hariharan, 2021;Wang et al., 2021).Frequently, there is little or even no labeled data for a particular task and hundreds or thousands of examples have to be collected and annotated.This particularly affects new applications and rare labels (e.g., detecting rare diseases or defects in manufacturing).Transfer learning and regularization methods are usually applied to reduce overfitting.However, for training with little data, the networks still have a considerable number of parameters that have to be fine-tuned -even if just the last layers are trained.
In contrast to neural networks, random forests are very robust to overfitting due to their ensemble of multiple decision trees.Each decision tree is trained on randomly selected features and samples.Random forests have demonstrated remarkable performance in many domains (Fernández-Delgado et al., 2014).While the generated decision rules are simple and interpretable, the orthogonal separation of the feature space can also be disadvantageous on other datasets, especially with correlated features (Menze et al., 2011).Additionally, random forests are not differentiable and cannot be fine-tuned with gradient-based optimization.
The combination of neural networks and random forests brings both worlds together.Neural networks have demonstrated excellent performance in complex data modeling but require large amounts of training data.Random forests are very good in learning with very little data without overfitting.The advantages of our approach for mapping random forests into neural networks are threefold: (1) We enable the generation of neural networks with very few training examples.(2) The resulting network can be used as a warm start, is fully differentiable, and allows further end-to-end fine-tuning.(3) The generated network can be easily integrated into any trainable pipeline (e.g., jointly with feature extraction) and existing high-performance deep learning frameworks can be used directly.This accelerates the process and enables parallelization via GPUs.
Mapping random forests into neural networks is already used in many applications such as network initialization (Humbird et al., 2019), camera localization (Massiceti et al., 2017), object detection (Reinders et al., 2018;2019), or semantic segmentation (Richmond et al., 2016).State-ofthe-art methods (Massiceti et al., 2017;Sethi, 1990;Welbl, 2014) create a two-hidden-layer neural network by adding a neuron for each split node and each leaf node of the decision trees.The number of parameters of the networks becomes enormous as the number of nodes grows exponentially with the increasing depth of the decision trees.Additionally, many weights are set to zero so that an inefficient representation is created.Due to both reasons, the mappings do not scale and are only applicable to simple random forests.
In this work, we present an imitation learning approach to generate neural networks from random forests, which results in very efficient models.We introduce a method for generating training data from a random forest that creates any amount of input-target pairs.With this data, a neural network is trained to imitate the random forest.Experiments demonstrate that the accuracy of the imitating neural network is equal to the original accuracy or even slightly better than the random forest due to better generalization while being significantly smaller.To summarize, our contributions are as follows: • We propose a novel method for implicitly transforming random forests into neural networks by generating data from a random forest and training an random forest-imitating neural network.Labeled data samples are created by evaluating the decision boundaries and guided routing to selected leaf nodes.
• In contrast to direct mappings, our imitation learning approach is scalable to complex classifiers and deep random forests.
• We enable learning and initialization of neural networks with very little data.
• Neural networks and random forests can be combined in a fully differentiable, end-to-end pipeline for acceleration and further fine-tuning.

Related Work
Random forests and neural networks share some similar characteristics, such as the ability to learn arbitrary decision boundaries; however, both methods have different advantages.Random forests are based on decision trees.Various tree models have been presented -the most well-known are C4.5 (Quinlan, 1993) and CART (Breiman et al., 1984).Decision trees learn rules by splitting the data.The rules are easy to interpret and additionally provide an importance score of the features.Random forests (Breiman, 2001) are an ensemble method consisting of multiple decision trees, with each decision tree being trained using a random subset of samples and features.Fernández-Delgado et al. (2014) conduct extensive experiments comparing 179 classifiers on 121 UCI datasets (Dua & Graff, 2017).The authors show that random forests perform best, followed by support vector machines with a radial basis function kernel.Therefore, random forests are often considered as a reference for new classifiers.
Neural networks are universal function approximators.The generalization performance has been widely studied.2014) and find that neural networks achieve good results but are not as strong as random forests.Sethi (1990) presents a mapping of decision trees to twohidden-layer neural networks.In the first hidden layer, the number of neurons equals the number of split nodes in the decision tree.Each of these neurons implements the decision function of the split nodes and determines the routing to the left or right child node.The second hidden layer has a neuron per leaf node in the decision tree.Each of the neurons is connected to all split nodes on the path from the root node to the leaf node to evaluate if the data is routed to the respective leaf node.Finally, the output layer is connected to all leaf neurons and aggregates the results by implementing the leaf votes.By using hyperbolic tangent and sigmoid functions, respectively, as activation functions between the layers, the generated network is differentiable and, thus, trainable with gradient-based optimization algorithms.The method can be easily extended to random forests by mapping all trees.
The authors propose a method that maps random forests into neural networks as a smart initialization and then fine-tunes the networks by backpropagation.Two training modes are introduced: independent and joint.Independent training fits all networks one after the other and creates an ensemble of networks as a final classifier.Joint training concatenates all tree networks into one single network so that the output layer is connected to all leaf neurons in the second hidden layer from all decision trees and all parameters are optimized together.Additionally, the authors evaluate sparse and full connectivity.Sparse connectivity maintains the tree structures and has fewer weights to train.In practice, sparse weights require a special differentiable implementation, which can drastically decrease performance, especially when training on a GPU.Full connectivity optimizes all parameters of the fully connected network.Massiceti et al. (2017) extend this approach and introduce a network splitting strategy by dividing each decision tree into multiple subtrees.The subtrees are mapped individually and share common neurons for evaluating the split decision.

Background and Notation
In this section, we briefly describe decision trees (Breiman et al., 1984), random forests (Breiman, 2001), and the notation used throughout this work.Decision trees consist of split nodes N split and leaf nodes N leaf .Each split node s ∈ N split performs a split decision and routes a data sample x to the left or right child node, denoted as c left (s) and c right (s), respectively.When using binary, axis-aligned split decisions, a single feature f (s) ∈ {1, . . ., N } and a threshold θ(s) ∈ R are the basis for the split, where N is the number of features.If the value of feature f (s) is smaller than θ(s), the data sample is routed to the left child node and otherwise to the right child node, denoted as (2) Data samples are routed through a decision tree until a leaf node l ∈ N leaf is reached which stores the target value.For the classification task, these are the estimated class probabilities P leaf (l) = (p l 1 , . . ., p l C ), where C is the number of classes.Decision trees are trained by creating a root node and consecutively finding the best split of the data based on a criterion.The resulting subsets are assigned to the left and right child node, and the subtrees are processed further.Commonly used criteria are the Gini Impurity or Entropy.
A single decision tree is very fast and operates on highdimensional data.However, it tends to overfit the training data by constructing a deep tree that separates perfectly all training examples.While having a very small training error, this easily results in a large test error.Random forests address this problem by learning an ensemble of n T deci-sion trees.Each tree is trained with a random subset of training examples and features.The prediction RF(x) of a random forest is calculated by averaging the predictions of all decision trees.

Neural Random Forest Imitation
Our proposed approach, called Neural Random Forest Imitation (NRFI), implicitly transforms random forests into neural networks.The main concept includes (1) generating training data from decision trees and random forests, (2) adding strategies for reducing conflicts and increasing the variety of the generated examples, and (3) training a neural network that imitates the random forest by learning the decision boundaries.As a result, NRFI enables the transformation of random forests into efficient neural networks.An overview of the proposed method is shown in Figure 1.

Data Generation
First, we propose a method for generating data from a given random forest.A data sample x ∈ R N is an N -dimensional vector, where N is the number of features.We select a target class t ∈ [1, . . ., C] from C classes and generate a data sample for the selected class.

DATA INITIALIZATION
A data sample x is initialized randomly.In the following, the feature-wise minimum and maximum of the training samples will be denoted as f min , f max ∈ R N .To initialize x, we sample x ∼ U (f min , f max ).In the next step, we will present a method for adapting the data sample to obtain characteristics of the target class.
< l a t e x i t s h a 1 _ b a s e 6 4 = " w Z a + P e j B m / e 9 Q p 5 1 a s a A f F E p X p X y 5 m A w 8 T d u 0 Q 3 u Y 6 i G V 6 Y I q q E N 0 + E K v 9 K Z c K 0 P l U X n 6 p i q p J G e L f i 3 l + Q v 9 n 5 C C < / l a t e x i t > x 5 < 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 r n M p a d d I 8 4 a 6 R T m 1 f 9 Z f p b P P j t m g p a p B s 2 M / v N N 9 / O 7 B i B Y 0 e x p r 2 n l L n 5 h c W l d C a 7 v L K 6 t p 7 b 2 K x F f j 8 0 e d X 0 H T 9 s G C z i j u 3 x a m z H D m 8 E I W e u 4 f C 6 0 T s R 8 f q A h 5 H t e z f x K O A t l 1 m e 3 b V N F g O q D 9 u 6 e q z u t 3 N 5 r a D J p U 4 7 e u L k K n x 9 1 s M x N / 4 f 6 l N n a 7 b e m K / P b X D e s 7 h f Q 3 i P j / j E q S 5 g E W t Y Z x 0 R L n C J 3 / j j a e + 7 9 8 P 7 e U / 1 + q q c S f y z v F 9 3 i E e e Z A = = < / l a t e x i t > Generated data sample < l a t e x i t s h a 1 _ b a s e 6 4 = " x V 7 v 2 k F f a P F N 0 u z A P 5 s 8 5 R q 0 X u 4 = " > A a u s f + S 8 q 5 i u R b z A S 5 Q 5 1 Q 1 s 4 z 3 q r M P D Z 3 z H D / w 0 e s Y X 4 6 t x O a U a c 3 n O c 9 x a x r e / c x + b D w = = < / l a t e x i t > x = (7, 6, 3, 1, 2), t = B < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 n n u 7 1 z 4 l a w X h O 9 6 h l d 4 q r 5 L X i p J 6 h t a 9 v q u n w U F H / 4 7 1 p r P Z a U f L 7 a W N p d Z a p x 7 5 P J 7 h O V 5 x r i t Y w 0 e s s 5 I E 3 3 C G c / w M J s H 3 4 E d w 8 p s a z N Q 5 T / H X C k 5 / A b 3 + n K g = < / l a t e x i t > Figure 2: Overview of the data generation process from a decision tree.First, the class distribution information is propagated from the leaf nodes to the split nodes (a).Afterward, data samples are generated by guided routing (Section 4.1.2) and modifying the data based on the split decisions (b).The weights for sampling the left or right child node are highlighted in orange.

DATA GENERATION FROM DECISION TREES
A decision tree processes an input vector x by routing the data through the tree until a leaf is reached.At each node, a split decision is evaluated, and the input is passed to the left child node or the right child node.Finally, a leaf l is reached which stores the estimated probabilities P leaf (l) = (p l 1 , . . ., p l C ) for each class.We reverse this process and present a method for generating training data from a decision tree.An overview of the proposed data generation process is shown in Figure 2. First, the class distribution information is propagated bottom-up from the leaf nodes to the split nodes (see Figure 2a) and we define the class weights W (n) = (w n 1 , . . ., w n C ) for every node n as follows: (3) For every leaf node, the class weights are equal to the stored probabilities in the leaf.For every split node, the class weights in the child nodes are summed up.
After preparation, data samples for a target class t are generated (see Figure 2b).For that, characteristics of the target class are successively added to the data sample.Starting at the root node, we modify the input data so that it is routed through selected split nodes until a leaf node is reached.The pseudocode is presented in Algorithm 1. .The weights are normalized by their L2-norm, denoted as ŵleft and ŵright .Afterward, the left or right child node is randomly selected as next child node n next depending on ŵleft and ŵright .
In the next step, the data sample is updated.We verify that the data sample is routed to the selected child node by evaluating the split decision.A split node s routes the data to the left or right child node based on a split feature f(s) and a threshold θ(s).If the value of the split feature x f(s) is smaller than θ(s), the data sample is routed to the left child node and otherwise to the right child node.The data sample is modified if it is not already routed to the selected child node by assigning a new value.If the selected child node is the left child node, the value has to be smaller than the threshold θ(s) and a new value within the minimum feature value f min,f(s) and θ(s) is randomly sampled: If the data sample is supposed to be routed to the right child node, the new value is randomly sampled between θ(s) and Algorithm 1 DATAGENERATIONFROMTREE Generate data samples from a decision tree Input: Decision tree split features f(n) and thresholds θ(n), target class t, feature minimum f min and maximum f max , class weights W (n) = (w n 1 , . . ., w n C ) for all nodes n ∈ N split ∪ N leaf Output: Data sample for target class t if feature f(n) is already used then  n ← n next 23: end while 24: return x the maximum feature value f max,f(s) : (5) This process is repeated until a leaf node is reached.In each node, characteristics are added that classify the data sample as the target class.
During this process, modifications can conflict with previous decisions because features are used multiple times within a decision tree or across multiple decision trees.Therefore, the current routing is weighted with a factor w path ≥ 1 to prioritize the path and not change the data sample if possible.
Overall, the presented method enables the generation of data samples and corresponding labels from a decision tree without adding any further data.

DATA GENERATION FROM RANDOM FORESTS
In the next step, we extend the method to generate data from random forests.Random forests consist of n T decision trees RF = {T 1 , . . ., T n T }.For generating a data sample x, the presented method for a single decision tree is applied to multiple decision trees consecutively.The initialization is performed only once and the visited features are shared.In each decision tree, the data sample is modified and routed to selected nodes based on the target class t.When using all decision trees, data samples are created where all trees agree with a high probability.For generating examples with varying confidence, i.e., the predictions of the individual decision trees diverge, we select a subset of n sub decision trees RF sub ⊆ RF.
All decision trees in RF sub are processed in random order to generate a data sample.For each decision tree, the presented method modifies the data sample based on the target class.
Finally, the output of the random forest y = RF(x) is predicted.In most cases, the prediction matches the target class.Due to factors such as the stochastic process, a small subset size, or varying predictions of the decision trees, it can be different occasionally.Thus, an input-target pair (x, y) has been created, showing similar characteristics as the target class and any amount of data can be generated by repeating this process.

AUTOMATIC CONFIDENCE DISTRIBUTION
The number of decision trees n sub can be set to a fixed value or sampled uniformly.Alternatively, we present an automatic process for determining an optimal distribution of the confidences for generating a wide variety of different examples.The strategy is motivated by importance weighting (Fang et al., 2020).We generate n data samples (n is empirically set to 1000) for each number of decision trees j ∈ [1, n T ].The respective generated datasets will be denoted as D j .
An optimal sampling process generates highly diverse data samples with different confidences.To achieve that, an automated balancing of the distributions is determined.A histogram with H bins is calculated for each D j , where h j i denotes the number of generated examples in the i-th interval (equally distributed) from the distribution with j decision trees.In the next step, a weight w D j is defined for each distribution, and we optimize w D as follows: where w D ∈ R n T .This optimization finds a weighting of the number of decision trees so that the generated confidences cover the full range equally.For that, the number of samples per bin h j i is summed up, weighted over all numbers of decision trees.After determining w D , the number of decision trees can be sampled depending on w D j .An analysis of different sampling methods will be presented in Section 5.5.1.Automatically balancing the number of decision trees generates data samples with low and high confidence very equally distributed.The process does not require training data and provides a universal solution.

Imitation Learning
Finally, a neural network that imitates the random forest is trained.The network learns the decision boundaries from the generated data and approximates the same function as the random forest.The network architecture is based on a fully connected network with one or multiple hidden layers.The data dimensions are the same as those of the random forest, i.e., an N -dimensional input and C-dimensional output.Each hidden layer is followed by a ReLU activation (Nair & Hinton, 2010).The last fully connected layer is using a softmax activation.
For training, we generate input-target pairs (x, y) as described in the last section.These training examples are fed into the training process to teach the network to predict the same results as the random forest.To avoid overfitting, the data is generated on-the-fly so that each training example is unique.In this way, we learn an efficient representation of the decision boundaries and are able to transform random forests into neural networks implicitly.In addition to that, the training is performed end-to-end on the generated data, and we can easily integrate the original training data.

Experiments
In this section, we perform several experiments to analyze the performance of neural random forest imitation and compare our method to state-of-the-art methods.In the following, we evaluate on standard benchmark datasets to present a general approach for various domains.While we focus on classification tasks in this work, NRFI can be simply adapted for regression tasks.

Datasets
The experiments are evaluated on nine classification datasets from the UCI Machine Learning Repository (Dua & Graff, 2017) (Car, Connect-4, Covertype, German Credit, Haberman, Image Segmentation, Iris, Soybean, and Wisconsin Breast Cancer (Original)).The datasets cover many realworld problems in different areas, such as finance, computer vision, games, or medicine.2014), we extract validation sets from the training set (e.g., for hyperparameter tuning).This ensures that the training and validation data are not mixed with the test data.For some datasets which provide a separate test set, the test accuracy is evaluated on the respective set.Missing values are set to the mean value of the feature.All experiments are repeated ten times with randomly sampled splits.The methods are repeated additionally four times with different seeds on each split.

Implementation Details
In all our experiments, stochastic gradient descent with Nesterov momentum as optimizer and cross-entropy loss are used.The initial learning rate is set to 0.1, momentum to 0.9, and weight decay to 0.0005.The batch size is set to 128 and 512, respectively, for generated data.The input data is normalized to [−1, 1].For generating a wide variety of data, the prioritization of the current path w path ∼ 1+|N (0, 5)| is sampled for each data sample individually.A new random forest is trained every 100 epochs to average the influence of the stochastic process, and the generated data samples are mixed.In the following, training on generated data will be denoted as NRFI (gen) and training on generated and original data as NRFI (gen+ori).The fraction of NRFI data is set to 0.9.Random forests are trained with 500 decision trees, which are commonly used in practice (Fernández-Delgado et al., 2014;Olson et al., 2018).The decision trees are constructed up to a maximum depth of ten.For splitting, the Gini Impurity is used and √ N features are randomly selected, where N is the number of features.

Results
The proposed method generates data from a random forest and trains a neural network that imitates the random forest.The goal is that the neural network approximates the same function as the random forest.This also implies that the network reaches the same accuracy if successful.
We analyze the performance by training random forests for each dataset and evaluating neural random forest imitation with different network architectures.A variety of network architectures with different depths, widths, and additional layers such as Dropout have been studied.In this work, we focus on two-hidden-layer networks with an equal number of neurons in both layers for clarity.The results are shown in Figure 3 exemplarily for the Car, Covertype, and Wisconsin Breast Cancer (Original) dataset.The other datasets show similar characteristics.The overall evaluation on all datasets is presented in the next section.The number of training examples per class is shown in parentheses and increases in each row from left to right.For each setting, the test accuracy of the random forest is indicated by a red dashed line.The average test accuracy and standard deviation depending on the network architecture, i.e., the number of neurons in the first and second hidden layer, are plotted for different architectures.NRFI (gen), which is trained with generated data only, is shown in orange and NRFI (gen+ori), which is trained with generated and original data, is shown in blue.
The analysis shows that the accuracy of the neural networks trained by NRFI reaches the accuracy of the random forest for all datasets.Only very small networks do not have the required capacity.The proposed method for generating labeled data from random forests by analyzing the decision boundaries enables training neural networks that imitate the random forests.For instance, in the case of 5 training examples per class, a two-hidden-layer network with 16 neurons in both layers already achieves the same accuracy as the random forest across all three datasets in Figure 3. Additionally, the experiment shows that the training is very robust to overfitting even when the number of parameters in the network increases.When combining the generated data and original data, the accuracy on Car and Covertype improves with an increasing number of training examples.
Overall, the experiment shows that the accuracy increases with an increasing number of neurons in both layers and NRFI is robust to different network architectures.NRFI is capable of generating a large variety of unique examples from random forests which have been initially trained on a limited amount of data.

Comparison to State of the Art
We now compare the proposed method to state-of-the-art methods for mapping random forests into neural networks and classical machine learning classifiers such as random forests and support vector machines with a radial basis function kernel that have shown to be the best two classifiers across all UCI datasets (Fernández-Delgado et al., 2014).In detail, we will evaluate the following methods: • DT: A decision tree (Breiman et al., 1984) learns simple and interpretable split decisions to classify data.The Gini Impurity is used for splitting.
• SVM: Support vector machine (Chang & Lin, 2011) is a popular classifier that tries to find the best hyperplane that maximizes the margin between the classes.As evaluated by Fernández-Delgado et al. (2014), the best performance is achieved with a radial basis function kernel.
Figure 4: Comparison of the state-of-the-art and our proposed method for transforming random forests into neural networks.
The closer a method is to the lower-left corner, the better it is (fewer number of network parameters and lower test error).
For neural random forest imitation, different network architectures are shown.Note that the number of network parameters is shown on a logarithmic scale.
• RF: Random forest (Breiman, 2001) is an ensemblebased method consisting of multiple decision trees.Each decision tree is trained on a different randomly selected subset of features and samples.The classifier follows the same overall setup, i.e., 500 decision trees and a maximum depth of ten.
• Sethi: The method proposed by Sethi (1990) maps a random forest into a two-hidden-layer neural network by adding a neuron for each split node and each leaf node.The weights are set corresponding to the split decisions.
• Welbl: Welbl (2014) and Biau et al. (2019) present a similar mapping with subsequent fine-tuning.The authors introduce two training modes: independent and joint.The first optimizes each small network individually, while the latter joins all mapped decision trees into one network.Additionally, the authors evaluate a network with sparse connections and regular fully connected networks (denoted as sparse and full).For each method, the average number of parameters of the generated networks across all datasets is plotted depending on the test error.That means that the methods aim for the lower-left corner (smaller number of network parameters and higher accuracy).Please note that the y-axis is shown on a logarithmic scale.The average performance of the random forests is indicated by a red dashed line.
The analysis shows that Sethi, Welbl (ind-full), and Welbl (joint-full) generate the largest networks.Network splitting (Massiceti et al., 2017) slightly improves the number of parameters of the networks.Using a sparse network architecture reduces the number of parameters.However, it should be noted that this requires special operations.NRFI with and without the original data is shown for different network architectures.The smallest architecture has 2 neurons in both hidden layers and the largest 128.For NRFI (gen-ori), we can see that a network with 16 neurons in both hidden layers ) is already sufficient to learn the decision boundaries of the random forest and achieve the same accuracy.When fewer training samples are available, NN-8-8 already has the required capacity.In the following, we will further analyze the accuracy and number of network parameters.

ACCURACY
The average test accuracy and standard deviation for all methods are shown in Table 1.Here, we additionally include decision trees, support vector machines, random forests, and neural networks in the comparison.The evaluation is performed on all nine datasets, and results for different numbers of training examples are shown (increasing from left to right).The overall performance of each method is summarized in the last column.For neural random forest imitation, a network architecture with 128 neurons in both hidden layers is used.From the analysis, we can make the following observations: (1) When training neural random forest imitation with generated data only, the method achieves 99.18% of the random forest accuracy (71.44% compared to 72.03%).This shows that NRFI is capable of learning the decision boundaries.( 2  NRFI introduces imitation instead of direct mapping.In the following, a network architecture with 32 neurons in both hidden layers is selected.The previous analysis has shown that this architecture is capable of imitating the random forests (see Figure 4 for details) across all datasets and different numbers of training examples.Our method significantly reduces the number of parameters of the generated networks while reaching the same or even slightly better accuracy.The current best-performing methods generate networks with an average number of parameters of either 142 000, if sparse processing is available, or 748 000 when using usual fully connected neural networks.In comparison, neural random forest imitation requires only 2676 parameters.Another advantage is that the proposed method does not create a predefined architecture but enables arbitrary network architectures.As a result, NRFI enables the transformation of very complex classifiers into neural networks.

Analysis of the Generated Data
To study the sampling process, we analyze the variability of the generated data as well as different sampling modes in the next experiment.Subsequently, we investigate the impact of combining original and generated data.

CONFIDENCE DISTRIBUTION
The data generation process aims to produce a wide variety of data samples.This includes data samples that are classified with a high confidence and data samples that are classified with a low confidence to cover the full range of prediction uncertainties.The following analyses are shown exemplarily on the Soybean dataset.This dataset has 35 features and 19 classes.First, we analyze the generated data with a fixed number of decision trees, i.e., the number of sampled decision trees in RF sub .The resulting confidence distributions for different numbers of decision trees are shown in the first column of Figure 5.When adopting the data sample to only a few decision trees, the confidence of the generated samples is lower (around 0.2 for 5 samples per class).Using more decision trees for generating data samples increases the confidence on average.NRFI uniform and NRFI dynamic sample the number of decision trees for each data point uniformly, respectively, optimized via automatic confidence distribution (see Section 4.1.4).The confidence distributions for both sampling modes are visualized in the second column of Figure 5. Additionally, sampling random data points without generating data from the random forest is included as a baseline.The analysis shows that random data samples and uniform sampling have a bias to generate data samples that are classified with high confidence.NRFI dynamic automatically balances the number of decision trees and archives an evenly distributed data distribution, i.e., generates the most diverse data samples.
In the next step, the imitation learning performance of the sampling modes is evaluated.The results are shown in Table 3. Random data generation reaches a mean accuracy of 63.80% while NRFI uniform and NRFI dynamic achieve 87.46% and 88.14%, respectively.This shows that neural random forest imitation is able to generate significantly better data samples based on the knowledge in the random forest.NRFI dynamic improves the performance by automatically optimizing the decision tree sampling and generating the largest variation in the data.

ORIGINAL AND GENERATED DATA
In the next experiment, we study the effects of training with original data, NRFI data, and combinations of both.For that, the fraction of NRFI data w gen is varied, which weights the loss of the generated data.Accordingly, the weight for the original data is set to w ori = 1 − w gen .The average accuracy over all datasets for different number of samples per class is shown in Figure 6.When the fraction of NRFI data is set to 0%, the network is trained with only the original data.When the fraction is set to 100%, the network is trained completely with the generated data.The study shows that training with NRFI data performs better than training with original data except for 50 samples per class where training with original data is slightly better.Combining original and NRFI data improves the performance.The best result is achieved when using mainly NRFI data with a small fraction of original data.

Conclusion
In this work, we presented a novel method for transforming random forests into neural networks.Instead of a direct mapping, we introduced a process for generating data from random forests by analyzing the decision boundaries and guided routing of data samples to selected leaf nodes.Based on the generated data and corresponding labels, a network is trained that imitates the random forest.Experiments on several real-world benchmark datasets demonstrate that NRFI is capable of learning the decision boundaries very efficiently.Compared to state-of-the-art methods, the presented implicit transformation significantly reduces the number of parameters of the networks while achieving the same or even slightly improved accuracy due to better generalization.Our approach has shown that it scales very well and is able to imitate highly complex classifiers.
The routing is guided based on the weights for the target class in the left child node w left = w cleft(n) t and right child node w right = w cright(n) t , ŵright ← normalize w left and w right 11: n next ← randomly select left or right child node with probability of ŵleft and ŵright , respectively 12:

Following
Fernández-Delgado et al. (2014), each dataset is split into a training and a test set using a 50/50 split while maintaining the label distribution.Afterward, the number of training examples is limited to n limit examples per class.We evaluate the training with 5, 10, 20, and 50 examples per class.In contrast to Fernández-Delgado et al. (

Figure 3 :
Figure 3: Test accuracy depending on the network architecture (i.e., number of neurons in both hidden layers).Different datasets are shown per row, with an increasing number of training examples per class from left to right (indicated in parentheses).The red dashed line shows the accuracy of the random forest.NRFI with generated data is shown in orange and NRFI with generated and original data in blue.With increasing network capacity, NRFI is capable of imitating and even outperforming the random forest.
) Overall, NRFI trained with generated and original data reaches state-of-the-art performance (50 samples per class) or outperforms the other methods (5, 10, and 20 samples per class).5.4.2.NETWORK PARAMETERSFinally, we will analyze the number of parameters of the generated networks in detail.The results are shown in Table 2. Current state-of-the-art methods directly map random forests into neural networks.The number of parameters of the resulting network is evaluated on all datasets with different numbers of training examples.The overall performance is shown in the last column.Due to the stochastic process when training the random forests, the results can vary marginally.

Figure 5 :
Figure 5: Probability distribution of the predicted confidences for different data generation settings on Soybean with 5 (top) and 50 samples per class (bottom).Generating data with different numbers of decision trees is visualized in the left column.Additionally, a comparison between random sampling (red), NRFI uniform (orange), and NRFI dynamic (green) is shown in the right column.By optimizing the decision tree sampling, NRFI dynamic automatically balances the confidences and generates the most diverse and evenly distributed data.

Figure 6 :
Figure 6: Analyzing the influence of training with original data, NRFI data, and combinations of both for different number of samples per class.Using only NRFI data (w gen = 100%) achieves better results than using only the original data (w gen = 0%) for less than 50 samples per class.Combining the original data and generated data improves the performance.
Olson et al. (2018)2020)strate that deep neural networks are capable of fitting random labels and memorizing the training data.Bornschein et al. (2020)analyze the performance across different dataset sizes.Olson et al. (2018)evaluate the performance of modern neural networks using the same test strategy asFernández-Delgado et al. ( Neural random forest imitation enables an implicit transformation of random forests into neural networks.Usually, data samples are propagated through the individual decision trees and the split decisions are evaluated during inference.We propose a method for generating input-target pairs by reversing this process and training a neural network that imitates the random forest.The resulting network is much smaller compared to current state-of-the-art methods, which directly map the random forest.
These techniques, however, are only applicable to trees of limited depth.As the number of nodes grows exponentially with the increasing depth of the trees, inefficient representations are created, causing extremely high memory consumption.In this work, we address this issue by proposing an imitation learning-based method that results in much more

Table 1 :
Average test accuracy [%] and standard deviation on all nine datasets for different numbers of training examples per class.The overall performance of each method is summarized in the last column.The best methods are highlighted in bold.

Table 2 :
Massiceti et al. (2017), and Welbl (joint-full) generate networks with around 980 000 parameters on average.Of the four variants proposed by Welbl, joint training has a slightly smaller number of parameters compared to independent training because of shared neurons in the output layer.Network splitting proposed byMassiceti et al. (2017)maps multiple subtrees while sharing common split nodes and reduces the average number of network parameters to 748 000.Using sparse network architectures additionally reduces the number of network parameters to about 142 000; however, this requires a special implementation for sparse matrix multiplication.All of the methods show a drastic increase Comparison to state-of-the-art methods.For each method, the average number of parameters of the generated neural networks is shown.While achieving the same or even slightly better accuracy, neural random forest imitation generates much smaller models, enabling the mapping of complex random forests.

Table 3 :
Imitation learning performance (in accuracy [%]) of different data sampling modes on Soybean.NRFI achieves better results than random data generation.When optimizing the selection of the decision trees, the performance is improved due to more diverse sampling.with the growing complexity of the classifiers.Sethi, for example, generates networks with 374 000 parameters when training with 5 examples per class.The average number of network parameters increases to 1.9 million when training with 50 examples per class.