Keywords

1 Introduction

Predictive modelling tasks are often addressed in both the batch and online learning settings, as predictive models immediately provide a highly desirable ability to predict some values on new unseen examples, potentially bypassing the need to perform timely and/or costly measurements. Less often considered, particularly in the online learning setting, are the related feature ranking tasks. In feature ranking for a predictive modelling task, such as classification, regression or more complex tasks of structured output prediction, we wish to determine which of the descriptive variables, i.e., features, are most important to the predictive modelling task at hand. By including only the most informative features, we can reduce the need for computational resources as well as eliminate the need and cost of measuring less informative features.

A feature ranking method is thus closely related to the underlying predictive task. In predictive modelling, a model is learned using incoming examples to best predict one or more target values, i.e., to generalize the dependence between descriptive and target values. In feature ranking, however, the model that is learned is tasked with ranking the descriptive features in terms of their importance for accurate prediction in the underlying predictive modelling task. Ideally, features that have a higher impact on the prediction should be ranked higher than those with lesser impact. Thus a feature ranking method is a learning procedure that produces a ranking based on the available data examples; in the online learning setting, this process is continuous and the ranking can change with time as potential drift occurs.

Formally, a feature ranking is a list of all features ordered according to their informativeness (importance for predictive modelling), i.e., starting with the most informative feature and ending with the least informative. However, methods for feature ranking often produce a more informative result, where each feature is assigned a numeric score estimating its importance. A ranking can be trivially obtained by sorting the features according to their scores.

In the batch learning setting, a plethora of methods for feature ranking are available across a variety of predictive modelling task, both simple, like classification and regression, as well as structured prediction tasks, such as multi-label classification, multi-target regression, etc. In the online learning setting, however, fewer methods for feature ranking exist, all of which exclusively focus on simple predictive modelling tasks.

A common approach in structured output prediction tasks is to decompose the problem into multiple simple (single-target) sub-problems, e.g., multiple binary classification sub-problems in multi-label classification or multiple single-target regression sub-problems in multi-target regression. Each sub-problem is then addressed using a simple single-target predictor and the predictions of all of these models are the used to solve the original structured problem.

Methods that address the structured problem in its entirety have been shown to have various advantages over this local decomposition approach, but in feature ranking they provide an additional benefit. Applying the local approach to feature ranking would yield a ranking per each of the targets. While these could be combined using various aggregation approaches, e.g., averaging, this introduces a non-trivial facet into the feature ranking procedure. Thus, we focus on approaches that consider the complex task as a monolith, i.e., without attempting to decompose it into smaller sub-problems.

In this paper we introduce the symbolic random forests with iSOUP-Trees (iSOUP-SymRF) feature ranking method, which utilizes the structure of a random forest of trees to produce the ranks of the observed features. While initially targeted at feature ranking for online multi-target regression, due to the versatility of the base iSOUP-Tree [15] method, the method we propose can be easily extended to other online structured output prediction tasks, such as online multi-label classification [14] or hierarchical multi-target regression [16], as well as to other learning contexts, such as semi-supervised learning [17]. To the best of our knowledge, this paper is the first effort towards online feature ranking for any structured output prediction task.

The rest of this paper is structured as follows. Section 2 presents relevant related work and Sect. 3 introduces the symbolic random forest approach for online feature ranking. Section 4 continues by describing the experimental setup that we use to evaluate the proposed method, while Sect. 5 presents the experiments’ results. Finally, Sect. 6 concludes the paper with a summary of the findings and presents avenues for further work.

2 Related Work

In the batch learning setting, there is a variety of feature ranking methods for the classification and regression tasks [26]. Methods for structured output prediction tasks, such as multi-target regression [21] are rarer. However, for the related task of feature selection, where a set of features needs to be selected, not necessarily ranked, several methods for multi-label classification are available [19].

Feature ranking is not addressed often as a standalone task in the online setting. It commonly encompassed under the name of feature weighing as part of a method for classification that weighs the input features [3, 4, 8, 23, 28]. Perkins et al. [20] introduced the grafting method that combines multiple types of regularization to estimate the importances of features and uses a logistic function of the binomial negative loss function to calculate the probabilities of the class presence. More recently, Razmjoo et al. [22] rank features based on a sensitivity analysis of the performance of a classifier under a potential feature removal.

Several methods that do address online feature ranking have been proposed. Katakis et al. [11, 12] introduce a feature-based classifier that uses a system for incremental feature selection (IFS) and explore how IFS impacts the predictive performance of simple online classification methods, such as, e.g., naïve Bayes. Another method that specifically addresses online feature ranking is I-RELIEF [27], which stands for iterative RELIEF, and is an adaptation of the Relief [13] method for batch feature ranking to the online learning setting. Both of these methods operate in the online predictive modeling scenario.

On the other hand, Yoon et al. [29] introduced a method for online feature selection that is unsupervised, i.e., it is not directly tied to a predictive modelling scenario. Their method utilizes the CLeVer method for principal component analysis. Recently, Duarte et al. [5] introduced methods for online feature ranking, designed specifically for methods that use the Hoeffding inequality and used them with AMRules [6], while Karax et al. [10] address the feature ranking for online classification by exploiting heuristic information of decision trees.

Other examples of online feature ranking come from related fields, such computer vision [2] and online image retrieval [9].

3 Symbolic Feature Ranking with Random Forests

In this paper, we adapt the symbolic approach to feature ranking with random forests to the online learning setting. This method was first introduced in the batch learning setting [21]. Petković et al. introduce several feature ranking methods for multi-target regression based on tree ensembles in the batch learning setting. In addition to the symbolic ranking with random forests, the authors introduce the Genie3 ranking method, which calculates the feature importance score based on the heuristic scores produced by the split nodes in the ensemble members, as well as the random forest score feature ranking method, which calculates the scores of the features by looking at the out-of-bag errors and feature value permutations. Note that the Genie3 method employs a method of scoring as similar to that of Karax et al. [10].

Of these three approaches, only the symbolicFootnote 1 random forest ranking method is directly applicable to the online learning scenario. To calculate the Genie3 feature importance scores, we need access to the splitting heuristic scores, which are easily accessed in the batch scenario. In online learning, the heuristic score of a split is calculated only on a small sample of the data, and is only partially indicative of the feature importance scores on the entire dataset. The random forest scoring method permutes the values of out-of-bag examples for each tree and observes how the error changes from the original, unpermuted example. This requires the permutation of many example values, after which many predictions must be calculated to estimate the error. While this approach could technically be applied to online learning, it would incur high consumption of computational resources, particularly in terms of processing time.

Symbolic random forest feature ranking with iSOUP-Trees (iSOUP-SymRF), however, calculates the feature importance scores using only the structure of trees which are the members of the ensemble. As we are targeting the task of feature ranking for online multi-target regression, we use iSOUP-Trees [15] as a base ensemble model. iSOUP-Tree is a state-of-the-art online learning method that has been applied to a variety of online structured output prediction tasks in addition to multi-target regression, such as multi-label classification [14] and hierarchical multi-target regression [16], thus extending the possible coverage of iSOUP-SymRF to feature ranking for these predictive modelling tasks as well.

Ensemble Construction. As in random forests utilized in batch learning, the main idea is to induce an ensemble of diverse randomized trees. Tree randomization is achieved in two ways, the first of which is example sampling as commonly used in online bagging [18], where each member of the ensemble is updated using a given example for a random number of times drawn from the Poisson distribution. The second way in which the trees are randomized is the selection of feature subspaces in each node when growing the tree. In particular, whenever a new leaf node is constructed, i.e., at the beginning of the learning procedure or when a leaf node is split into two new leaf nodes, a subset of the input features is randomly selected. The new leaf node only considers those input features for ranking split candidates; thus, statistics are recorded only for the selected input features.

Feature Score Calculation. iSOUP-SymRF is based on the following observation: if a feature’s values are important for accurate prediction, the feature will get selected in splits often. More accurately, it will get selected often when it can be, as it is not always considered for candidate splits due to the random forest learning process. If the base ensemble was constructed using a regular online bagging approach, the variety in the models would be considerably lesser, as the trees would always be able to select the best feature(s). This would over-focus the scores to only the top features. In random forest tree construction, the best features are sometimes left out of the candidate feature pool, thus allowing the estimation of the importance scores of the remaining (less important) features.

To estimate the importance of a feature we make two observations: (a) the more often a feature appears in the split nodes in the random forest, the more important it is, and (b) the closer to the root of the tree the feature appears, the higher its importance. The first observation follows the reasoning, that, despite the random selection, a feature appearing more often means that splits along this feature increase the predictive performance of the tree, according to standard tree-learning methodology. The second observation is based on the fact that split nodes closer to the tree root will affect more examples than those positioned further from the root. For example, a split at the root node will affect all incoming examples, while a split in one of its children will (on average) only affect half of the examples, i.e., the ones for which the root split node directed them toward this particular child and not the other.

Quantitatively, to calculate a feature importance score of a feature A, we first calculate the feature importance score of A for a given ensemble member T, which is defined as

$$ {\text {I}}(A, T) = \sum _{\mathcal {N} \in T(A)} w^{{\text {depth}}(\mathcal {N})}\text {,} $$

where w is a predefined weight and T(A) is the set of all split nodes of tree T which have splits on feature A. The total feature importance of feature A is then

$$ {\text {I}}(A, E) = \frac{1}{|E|}\sum _{T \in E} {\text {I}}(A, T) = \frac{1}{|E |} \sum _{T \in E} \sum _{\mathcal {N} \in T(A)} w^{{\text {depth}}(\mathcal {N})}\text {,} $$

where E is the ensemble of trees. We adapt this method to online learning, as the calculation of the scores is quick, since it requires only the traversal of each tree in the ensemble.

As the scores iSOUP-SymRF calculates are exclusively dependent on the structure of the trees of the random forest, no predictions are needed, which significantly reduces the operational time of the method. Notably, the process of calculating the feature ranking is fairly quick and can be executed at any time during the learning process, so the ranking is always available.

What remains is the choice of the weight factor w. When considering its possible values, we note that \(w < 1\) gives higher scores to features which appear closer to the root, and, consequently, affect the larger parts of the input space. To settle on a particular value of w, we observe the following example. Consider a leaf in which two best features \(A_1\) and \(A_2\) have the exact same heuristic score. In the first case, we split on the first feature and likewise in the second case, we split on the second feature. Afterwards, in the first case we split both leaves on \(A_2\), and in the second case on \(A_1\) (see Figs. 1a and 1b, respectively).

Fig. 1.
figure 1

Sample trees motivating the selection of the weight parameter w.

In both cases, all example traversal paths include splits on \(A_1\) and on \(A_2\). This implies that \(A_1\) and \(A_2\) should have equal importance scores, as they affect the same sets of examples. Under this assumption it follows that

$$\begin{aligned} {\text {I}}(A_1, T_1)&= {\text {I}}(A_1, T_2) \\ w^d&= w^{d+1} + w^{d+1} \\ 1&= 2 w \\ 0.5&= w\text {,} \end{aligned}$$

where d is the depth of the initial twice-split leaf. Hence, we choose \(w = 0.5\).

In choosing \(w=0.5\), we note that the total contribution of any level in the tree will equal to 1 and, consequently, the total of all scores of a tree will be about equal to its average depth. Thus, the total scores of a tree (and the random forest) will increase over time as the trees grows.

Parameters. As is standard practice with random forest methods, we define two method parameters for iSOUP-SymRF, ensemble size and subspace size. The first determines the number of base models included in the ensemble, while the second determines how many features are considered as split candidates in each leaf. Ensemble size commonly ranges between 10 and 100, while we use two selection methods for subspace size, either randomly selecting \(1 + \lceil \log {N} \rceil \) or \(1 + \lceil \sqrt{N} \rceil \) features in each leaf, where N is the number of all features.

Learning Context. iSOUP-SymRF notably does not have an explicit change detection and adaptation mechanism. While trees have a small innate change adaptation ability, by just growing additional nodes that adhere to the new concept, this is likely not enough to capture the drift in a reasonable time frame. Thus, in the context of this paper we consider learning (and experiments) in the static context, i.e., we assume no concept drift in the data stream.

4 Experimental Setup

4.1 Datasets

In the interest of brevity, we have selected three multi-target regression datasets, based on their size, primarily looking for diversity in the number of input features. A summary of the datasets and their properties is shown in Table 1, while brief descriptions of the datasets are provided below.

Table 1. Datasets used for online feature ranking for multi-target regression.

The Bicycles dataset is concerned with the prediction of demand for rental bicycles on an hour-by-hour basis [7]. The three targets represent the number of casual (non-registered) users, the number of registered users and the total number of users for a given hour, respectively.

The SCM1d and SCM20d are datasets derived form the Trading Agent Competition in Supply Chain Management (TAC SCM) conducted in July 2010 [25]. The data examples correspond to daily updates in a tournament – there are 220 days in each game and 18 games per tournament. The 16 targets are the predictions of the next day and the 20 day mean price for each of the 16 products in the simulation, for the SCM1d and SCM20d datasets, respectively.

The Bicycles dataset is available at the UCI Machine Learning RepositoryFootnote 2, the SCM1d and SCM20d datasets are available at the Mulan multi-target regression dataset repositoryFootnote 3.

Even though iSOUP-SymRF is not equipped with explicit change detection and adaptation mechanisms, these datasets do contain drift. As this provides an additional challenge to properly estimating the feature importances, the obtained results are pessimistic estimates of the method’s performance in the static context we presuppose in our experiments. Performance on static data streams (or more realistically, in periods without drift) would thus possibly be improved over what is shown in the results of our experiments.

4.2 Experiment: Parameter Stability

In this experiment, we explore which parameter settings produce good rankings, in particular, we are interested in parameter configurations that produce stable rankings that “converge” fairly quickly. Notably, this is not a concern in the batch learning scenario, where the learned ranking is static. We perform a small-scale grid-search on the two parameters, ensemble and subspace size. We consider ensembles of sizes 10, 20, 50 and 100, as well as subspaces of logarithmic and square root size. In terms of resource consumption, smaller ensemble and subspace sizes are naturally preferable, thus, the smallest parameter configuration (10 models with logarithmic subspaces) will take the baseline role in this experiment.

To evaluate the rate of convergence, we define the time to final ranking (TTF), i.e., how many examples it takes for the ranking to reach the same order as the final ranking. Thus, lower values of TTF are more desirable. Notably, TTF considers the ranking only in terms of the feature ranks, ignoring the finer detail of the scores themselves. Furthermore, TTF is not particularly bad at evaluating rankings on data streams that exhibit drift, as the ranking is likely to fluctuate any time a drift occurs. As TTF considers all features equally (according to their rank), it can have large values due to changes in the tail end of the feature ranking. As we are generally more interested in the top ranked features, we also define TTF\(_{n}\) which only considers the top n features. We will then observe TTF5 to determine how quickly the top five features settle.

Notably, these measures only make sense in a static context. Were drift to occur, the actual importances would possibly get rearranged and the estimated importances would again take time to converge. In the drifting scenario, observation of these two measures would only make sense between drift points.

4.3 Experiment: Ranking Utility

To estimate the utility of the rankings obtained by iSOUP-SymRF, we use forward feature addition (FFA) [24]. Forward feature addition is performed by observing the performance of a predictive model, while adding features from best to worst. In particular, we first observe the performance of a model learned only using the best ranked feature, then using the two best features, then the three best features, etc.

In the batch learning scenario, observing the performances of the models in these scenarios is fairly simple, as the performance of a model can be easily expressed as a single number, e.g., root mean squared error over a test set. In the online learning setting, this kind of generalization is less informative. Thus, in this paper we examine the progression of the performance (in terms of error) of the observed models.

Furthermore, even though iSOUP-SymRF can produce feature rankings at any point in the learning process (rankings can change throughout the process), for FFA, we need to consider only one static rankingFootnote 4. While this is not ideal, demonstrating the utility of the rankings is notoriously difficult and using an imperfect method to directly show the impact on the learning process still provides considerable insight into the applicability of the proposed method. To this end, we take the final feature ranking obtained by iSOUP-SymRF, i.e., the ranking after all of the examples have been processed. This biases the results towards optimistic, as in a practical use scenario this information would not be available during learning.

In our case, we select two methods for online multi-target regression to evaluate the rankings obtained by iSOUP-SymRF. In particular, we use a single iSOUP-Tree [15] and AMRules [6], with default learning parameters. As iSOUP-Tree is the method that iSOUP-SymRF is based on, it is natural to expect that this combination will yield better results than when combining iSOUP-SymRF with AMRules. To estimate the error of these models we use the average relative mean absolute error (\(\overline{{\text {RMAE}}}\) or \({\text {aRMAE}}\)) [15]:

$$\begin{aligned} \overline{{\text {RMAE}}} = \frac{1}{M} \sum _{j=1}^M {\text {RMAE}}_j \end{aligned}$$

where M is the number of targets and \({\text {RMAE}}_j\) is the relative mean absolute error of the j-th target, defined as

$$\begin{aligned} {\text {RMAE}}_j = \frac{\sum _{i=1}^n |y_i^j - \hat{y}_i^j|}{\sum _{i=1}^n |y_i^j - \bar{y}^j(i)|} \end{aligned}$$

where \(y_i^j\) and \(\hat{y}_i^j\) are the values of target j for data example i, real and predicted by the evaluated model, respectively, while \(\bar{y}^j(i)\) is the average of the seen values of the j-th target so far.

As some datasets have many features, we limit ourselves to reporting FFA plots only for the 5 top and 5 bottom features. Using a good ranking, adding the top features in FFA should considerably increase model performance, while adding bottom features should barely affect performance (or even worsen it).

Table 2. The results in terms of TTF and TTF5.

5 Results

5.1 Parameter Stability

The results of the parameter stability experiment are presented in Table 2. The winning result in terms of ensemble size with regards to each of TTF and TTF5 are presented in bold text, while the winning results in terms of subspace size are underlined (a dotted underline indicates a tie).

In terms of TTF, a square root size of the random space is generally preferred, while in terms TTF5 no particular generalization can be made, as logarithmic and square root subspace size win 6 of the contests each. For the Bicycles dataset, we observe that the TTF5 values are generally considerably lower than the TTF values, indicating that the top features converge faster, while the ranks of the remaining features continue to get perturbed. Even though the Bicycles dataset exhibits strong seasonal effects (i.e., drift), the top ranked features appear not to change between seasons, as they stabilize early through the data.

On the other hand, for the SCM1d and SCM20d datasets the TTF5 are much closer to the TTF values. This indicates that the rankings are turbulent through the entirety of these datasets. Any obtained final rankings are thus possibly not close to converging to the actual underlying feature importances.

The results are even less clear regarding the preferred ensemble size. Larger sizes of the ensemble produce better results on the Bicycles dataset, while smaller ensembles perform better on the SCM1d and SCM20d datasets, with the exception of TTF5 for 100 trees with a logarithmic subspace size.

A key factor that may influence these results is the total number of features. Regarding the feature set size, in the Bicycles dataset, which is smaller with a total of 12 features, any feature is likelier to get selected as a feature candidate even though the subspace size is relatively large compared to the number of all features (4 vs 12 features for logarithmic, and 5 vs 12 for square root size). This makes it easier to identify key features in this case, resulting in low TTF5 values. On the other end, SCM1d has 280 features, and the subspaces are of sizes 7 and 18 for logarithmic and square root, respectively. While such low sizes are desirable in the context of random forests to reduce resource consumption, they make it more difficult to identify top features using a random forest based feature ranking method such as iSOUP-SymRF.

Fig. 2.
figure 2

Feature score progression on the Bicycles dataset in terms of ensemble size. All plots show square root subspace size.

In this context, we can also examine the actual scores and their progression through the learning process. With only 12 features, we can observe the feature scores of the Bicycles dataset directly, as seen in Fig. 2. In addition to the score progression, the plots also show the TTF and TTF5 points. Over all ensemble sizes iSOUP-SymRF identifies ‘hr’, ‘hum’, ‘workingday’, ‘temp’, ‘atemp’, ‘weekday’ and ‘weathersit’ as the top features, though there is some disagreement about their final ranks. Here, we can see the decreased disambiguation power of smaller ensemble sizes, particularly, in the case of 10 and 20 trees. In these cases, the scores indicate the best features, but their ordering takes longer to establish. Larger ensembles, on the other hand, fairly quickly establish the order, which (for the most informative features) then remains stable. Notably, the choice of the top observed features, i.e., the 5 in TTF5, also impacts the results. Clearly, choosing a lower number lowers the TTF value, but in some cases we could also have observed a larger number of top features and lost little confidence, i.e., TTF7 would be the same as TTF5 in the case of 100 trees.

Ultimately, this experiment does not unilaterally indicate which parameter choices to make. While square root, i.e., larger, subspace sizes seem to be preferred, no clear statement can be made about ensemble size. Thus, motivated by the Bicycles dataset example above, we choose square root subspace size with ensemble size of 100 for our further experiments.

5.2 Ranking Utility

The results using the FFA methodology are presented in Fig. 3. The plots are interpreted in the following way: lines labeled with positive numbers \(n \in \{1, 2, 3, 4, 5\}\) depict the \(\overline{{\text {RMAE}}}\) of models trained on the top n features; lines labeled with negative numbers \(n \in \{-5, -4, -3, -2, -1\}\) depict the \(\overline{{\text {RMAE}}}\) of models trained with all but the last |n| features.

Fig. 3.
figure 3

\(\overline{{\text {RMAE}}}\) of iSOUP-Tree and AMRules using FFA.

In terms of the top features, we wish to see the largest increase when adding higher ranked features, e.g., the increase in performance should be larger when we add the second best feature than the third best. All iSOUP-Tree models exhibit this behaviour, as do the AMRules models, with the exception of the Bicycles dataset. This exception is most likely to AMRules change detection and adaptation mechanism which significantly modifies the model during learning.

Conversely, regarding bottom ranked features, the desired effect is either trivial improvement or even decrease in performance. This is shown in all datasets and methods. In the case of the iSOUP-Tree models, for example, we can see that, both on the Bicycles and SCM1d datasets, adding features toward the bottom actually hurts the overall performance of the model, i.e., all negative labeled lines are above (have higher errors) the highest positively labeled line. On the SCM20d dataset, the addition of the bottom features only has marginal effect. These results are mirrored in quality for the AMRules models, though the effect sizes are considerably different. On the Bicycles dataset, adding the lowest ranked features is either detrimental or has little to no effect on the error of the model, while on the SCM1d and SCM20d datasets we observe the same behaviour as with iSOUP-Tree, except that the decrease in the performance on the SCM1d dataset is significantly higher, where as slight increases on the SCM20d dataset are observed as compared to those in the iSOUP-Tree models.

Naturally, using a tree-based feature ranking method such as iSOUP-SymRF provides good results when used alongside another tree-based method such as iSOUP-Tree. However, iSOUP-SymRF also shows encouraging (if worse) results using the different learning framework in AMRules. Notably, these experiments also indicate that some of the features included in these datasets can be actively detrimental when learning with these two learning methods.

6 Conclusions and Further Work

In this paper, we have introduced a novel method for feature ranking for online multi-target regression called iSOUP-SymRF. It utilizes a random forest of iSOUP-Trees to determine the feature importance scores (and consequently the feature ranking), based on the features’ appearance in the split nodes of the trees in the of the forest. We have conducted experiments on a collection of multi-target regression datasets, aiming to (a) determine the methods stability against the values of its parameters and (b) show that the obtained rankings have some utility for increasing the predictive performance of online-multi target regressors.

Our experiments first focused on determining the methods stability over various parameter values. While the experiments were not fully conclusive, we suggest the use of larger subspace sizes (such as square root), while the random forest ensemble size should be further analyzed. The experiments that seek to show the utility in using feature rankings obtained with iSOUP-SymRF show promising results using two different learning methods for online multi-target regression, iSOUP-Tree and AMRules. As expected, the results were better for the related tree-based iSOUP-Tree method, though the feature ranking obtained by iSOUP-SymRF still provided ample utility in terms of predictive performance improvement when used in combination with AMRules.

We identify three key avenues for further work: the main effort will be to equip the iSOUP-SymRF with a change detection and adaptation mechanism, e.g., ADWIN [1]. This will allow us to more accurately capture the evolution of the feature rankings, especially in the presence of concept drift. Another avenue is the adaptation of iSOUP-SymRF to feature ranking for other online structured output prediction tasks, such as online multi-label classification and/or hierarchical multi-target regression, by utilizing the broad coverage of the base iSOUP-Tree method [14, 16]. This would also allow a comparison of a wider variety of learners, as methods for online multi-label classification are quite plentiful. Finally, we wish to explore and improve the experimental setup, focusing on better and more concise approaches for the evaluation of feature rankings in the online learning setting.