Abstract
We address the task of multitarget regression, where we generate global models that simultaneously predict multiple continuous variables. We use ensembles of generalized decision trees, called predictive clustering trees (PCTs), in particular bagging and random forests (RF) of PCTs and extremely randomized PCTs (extra PCTs). We add another dimension of randomization to these ensemble methods by learning individual base models that consider random subsets of target variables, while leaving the input space randomizations (in RF PCTs and extra PCTs) intact. Moreover, we propose a new ensemble prediction aggregation function, where the final ensemble prediction for a given target is influenced only by those base models that considered it during learning. An extensive experimental evaluation on a range of benchmark datasets has been conducted, where the extended ensemble methods were compared to the original ensemble methods, individual multitarget regression trees, and ensembles of singletarget regression trees in terms of predictive performance, running times and model sizes. The results show that the proposed ensemble extension can yield better predictive performance, reduce learning time or both, without a considerable change in model size. The newly proposed aggregation function gives best results when used with extremely randomized PCTs. We also include a comparison with three competing methods, namely random linear target combinations and two variants of random projections.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Supervised learning is a highly active and researched area of machine learning. Its goal is to produce a model, that can take a previously unseen example and predict the value of a variable of interest, typically called a target variable. If the target variable is of a discrete type, the task at hand is classification. If the target variable is of a numeric data type, the task is called regression. Such singletarget (ST) prediction scenarios are very common.
A number of challenges from various domains require a more complex representation of the data. In those cases, we need to move away from generating models that make predictions for one target variable to models that make predictions for multiple targets simultaneously, i.e., address the task of multitarget (MT) prediction. In general, MT prediction falls under the scope of structured output prediction (SOP). SOP, as the name suggests, is concerned with predicting values of structured data types, which are composed of values of primitive data types, e.g., boolean, real numbers, or discrete values (Panov et al. 2016). Examples of structured data types are tuples, sequences, sets, treeshaped hierarchies, directed acyclic graphs, etc. (Džeroski 2007). Examples of SOP tasks are multitarget regression (MTR), (hierarchical) multilabel classification (MLC) and time series prediction. Solving SOP tasks has great potential and importance in many domains, and has been listed as one of the most challenging problems in machine learning by Yang and Wu (2006) and Kriegel et al. (2007).
This work considers the task of MTR—predicting multiple continuous variables. Many real life scenarios exist where one is interested in predicting multiple numerical values, e.g., in ecology (Demšar et al. 2006; Stojanova et al. 2010) and life sciences (Jančič et al. 2016). MT prediction methods differ mostly in the way they exploit the target space structure during learning a predictive model. The most natural and simple starting point is to make a model for each component of the structure separately. The models predicting the individual components are then combined to make predictions for the whole structure. Such methods are called local, because they learn a local model for only one component at a time, whilst ignoring the other components (i.e., global context). Hence, local methods cannot exploit information hidden in the combination of multiple components and relationships between them. In contrast to local methods, global methods take into account all the structural components and their relations and then make predictions for all of them simultaneously. In general, this makes global models more interpretable. Global models (as well as the process of learning them) are also more computationally efficient as compared to local ones. This becomes especially evident when the predicted structure consists of many components. Learning global models can therefore yield better predictive performance while consuming less resources.
In this paper, we propose a new ensemble extension method for the task of MTR, called Random Output Selections (ROS). The method uses predictive clustering trees (PCTs) as base models in the ensembles. PCTs, a generalization of decision trees, are global models for SOP able to solve MTR and MLC tasks, among other (Blockeel et al. 1998). The proposed method can be coupled with any ensemble learning method that employs global decision trees as base learners. The method learns each base model on a random subset of all target variables (each model has its own subset of variables). In this work we apply ROS on three different ensemble learning methods: bagging (Bag) (Breiman 1996; Kocev et al. 2013), random forests (RF) (Breiman 2001; Kocev et al. 2013) and extremely randomized trees (ET) (Geurts et al. 2006; Kocev and Ceci 2015). Analogously, we refer to extended methods as BagROS, RFROS and ETROS.
The main focus of this study is to determine whether the proposed method can improve the predictive performance and shorten the learning times of the considered ensemble methods. An extensive empirical evaluation over a variety of benchmark datasets in performed in order to determine the effects of using ROS on predictive performance. In addition we perform an analysis with respect to time and space complexity.
We summarize the main contributions of this work as follows:

A novel global ensemble extension approach for addressing the MTR task. It randomly selects subsets of targets for learning individual base models. This can yield better predictive performance and shorter learning times.

A novel aggregation function for the prediction of base models in the extended ensembles. By default, each tree in the ensemble predicts all targets. Here, only targets that were considered during the learning of an individual tree contribute to the final predictions.

An extensive empirical evaluation of the three different ensemble methods on 17 benchmark datasets, which provide performance assessment for the original and extended ensemble methods, as well as individual multitarget regression trees and ensembles of singletarget regression trees (all over a range of ensemble sizes). The study also includes parameter setting recommendations for the proposed method. Moreover, we compare the performance to other competing methods that also transform the output space.

Theoretical computational complexity analysis of the proposed ensemble extension method, linked to the empirical evidence of the aforementioned evaluation.
The remainder of this paper is organized as follows. Section 2 outlines the task definition and related work. Next, Sect. 3 presents the proposed method ROS. Section 4 then provides details about the experimental design and evaluation, the results of which are presented and discussed in Sect. 5. Finally, we conclude the paper and provide directions for further work.
2 Background and related work
2.1 Task definition
In this section, we formally describe the machine learning task of MTR. Given:

An input space X, with tuples of dimension d, containing values of primitive data types, i.e., \(\forall x_{i} \in X, x_{i} = (x_{i_{1}},x_{i_{2}}, \ldots ,x_{i_{d}})\),

An output (target) space Y, with tuples of dimension t, containing real values, i.e., \(\forall y_{i} \in Y, y_{i} = (y_{i_{1}},y_{i_{2}}, \ldots ,y_{i_{t}})\), where \(y_{i_{k}} \in \mathbb {R}\) and \(1 \le k \le t\)

A set of examples S, where each example is a pair of tuples from the input and the output space, i.e., \(S = \{(x_{i}, y_{i})  x_{i} \in X, y_{i} \in Y, 1 \le i \le N \}\) and N is the number of examples in S (\(N = S\)),

A quality criterion c, which rewards models with high predictive accuracy and low complexity.
Find: A function \(f:X \rightarrow Y\) such that f maximizes c.
In this work, the function f is represented by an ensemble (set) of PCTs. It will be learned by the approaches of bagging (Bag), Random Forests (RF) and extraPCTs (ET) and their ROS counterparts BagROS, RFROS and ETROS.
2.2 Related work
Our work relates to three main areas: solving the task of MTR, ensemble learning and output space decomposition. Multitarget regression, also referred to as multitarget, multivariate or multiresponse regression, is a machine learning task, where the goal is to predict multiple realvalued variables simultaneously. Borchani et al. (2015) divide multitarget regression methods into two categories: problem transformation methods and algorithm adaptation methods.
Problem transformation methods include the work of SpyromitrosXioufis et al. (2016) on singletarget approaches, multitarget regressor stacking and regressor chains, and the work of Zhang et al. (2012) on multioutput support vector regression, and the work of Tsoumakas et al. (2014) on random linear target combinations. These methods transform the output space in such a way, that it is possible to apply existing methods to solve the task at hand. The transformation process usually converts a multitarget problem into several singletarget ones, thus approaching the MTR problem locally. However, transformation methods exist, that solve the MTR problem locally but, on the other side, include multiple targets into the learning process thus making them not fullylocal nor fullyglobal, e.g., Tsoumakas et al. (2014).
Algorithm adaptation methods, on the other hand, have the capability to handle multitarget tasks naturally, i.e., no transformation of the data is needed. Such methods are global. They can exploit the potential relatedness of the targets to learn models with better predictive performance faster as compared to problem transformation methods. Algorithm adaptation methods include statistical methods, such as those by Abraham et al. (2013), Breiman and Friedman (1997), Izenman (1975), multioutput support vector regression work by Xu et al. (2013), Han et al. (2012), Deger et al. (2012), kernel methods by Alvarez et al. (2012), Micchelli and Pontil (2004), multitarget regression trees (Kocev and Ceci 2015; Levatić et al. 2014; Appice and Malerba 2014; Kocev et al. 2013; Stojanova et al. 2012; Ikonomovska et al. 2011; Appice and Džeroski 2007) and rule based methods for MTR by Aho et al. (2012). Due to the plethora of existing methods, we will not discuss all of them here but rather briefly describe the ones most closely related to our work.
Predictive clustering trees (PCTs) have been introduced by Blockeel et al. (1998). A PCT is a generalization of a standard decision tree, which can be instantiated to support different tasks of SOP, one of which is the task of MTR. PCTs are decision tree based models that belong to the algorithm adaptation methods group, because they handle the task without transforming the instance space. PCTs are global models, since they give predictions for all targets simultaneously. PCTs are instantiated by two required parameters: the variance and prototype functions. Technically, PCTs perform divisive hierarchical clustering by using the provided variance function. The variance function is used to calculate heuristic scores that guide the learning process until a stopping criterion is met. This eliminates the need for arbitrarily selecting the number of clusters beforehand, as required by traditional clustering methods. When instances are clustered, the prototype function is used to calculate the predictions on all leaf nodes (terminal clusters of the hierarchy). A detailed explanation of PCTs can be found in Sects. 3.1 and 3.2.
Ensembles of PCTs were introduced by Kocev et al. (2007, 2013), specifically bagging and random forests. Kocev and Ceci (2015) later extended extremely randomized trees, initially introduced by Geurts et al. (2006), to structured outputs. They called them extraPCTs, because they based the implementation on PCTs. Extremely randomized trees select one random split point for each of k predictive attributes at each split. The best performing split point is selected and the process recursively continues. Extremely randomized trees and their multioutput variant (extraPCTs) are very unstable models, so it only makes sense to use them in an ensemble setting.
Several multitarget prediction approaches that transform the output space do exist, but they mainly focus on the task of MLC. Joly et al. (2014) reduce the dimensionality of the output space by making random projections of it. They make the projections in such a way, that they preserve the original distances in the projected space. Their approach uses JohnsonLindenstrauss lemma. If the output space projection matrix satisfies the lemma, the variance computations in the projected space will be \(\epsilon \)approximations of the variance in the original output space. They employ Gaussian, Rademacher, Hadamard and Achlioptas projections to compress the output space. Only the variance calculations are made in the projected space while the predictions are made directly in the original output space (i.e., no decoding needed). They use multioutput regression trees to calculate the variances in the projected space and then apply thresholding to obtain predictions for labels (i.e., MLC setting). Our approach is simpler, as it does not perform a transformation of the output space but only takes a subset of it, which makes it more straightforward. They do not report any results for the MTR setting. Joly (2017) proposes a gradient boosting method for MTR that uses random projections of the output space to automatically adapt to the output correlation structure.
Tsoumakas and Vlahavas (2007) propose an ensemble method RAkEL (Random klabelsets) for the task of MLC, which is also transformationbased. RAkEL is an ensemblelike wrapper method for solving multilabel classification tasks with existing algorithms for multiclass classification. They construct the ensemble by providing a small random subset of k labels (organized as a label powerset) to each base model, learned by a multiclass classifier. This results in an additional step in the prediction phase because predictions need to be decoded. In addition to this, RAkEL’s computational complexity is high because the generated output spaces are label powersets and the underlying classification algorithm is a parameter, which can considerably change/worsen the training times (e.g., if we use SVMs instead of ordinary decision trees). This approach has been extended by Szymański et al. (2016), where the authors propose not to use the original random partitioning of subsets as performed by RAkEL, but rather a datadriven approach. They propose to use community detection approaches from social networks to partition the label space, which can find better subspaces than random search.
Madjarov et al. (2016) also use a data driven approach to solve the task of MLC. They use label hierarchies which they obtain from hierarchical clustering of flat label sets by using annotations that appear in the training data. Finally, the work of Tsoumakas et al. (2014) considers the MTR task. They use random linear target combinations to enrich the output space by constructing many new target variables. They use a predefined number of original target variables in each random combination and then transform the original output space matrix by multiplying it with the coefficient matrix consisting of new combinations.
3 Ensembles for multitarget regression with random output selections
The following section introduces the ROS ensemble extension method. We consider the proposed approach as a global method that belongs to the algorithm adaptation group of methods. Although ROS uses subsets of the output space during learning, the learned ensemble provides predictions for all target variables simultaneously. We first describe the predictive clustering paradigm and then explain the process of learning a single predictive clustering tree (PCT). Next, we present the proposed method for learning ROS tree ensembles. Finally, we provide a computational complexity analysis of the proposed approach.
3.1 Predictive clustering
The predictive clustering (PC) framework has been introduced by Blockeel (1998). It can be seen as a generalization of supervised and unsupervised learning. The two learning approaches are traditionally considered as two separate machine learning tasks. However, there are supervised methods (e.g., decision trees and rules) that partition the instance space into subsets, which makes it possible, to interpret them as clustering approaches. Unsupervised learning groups/clusters examples that are similar according to some distance measure. In supervised learning, the primary goal is to make predictions. The PC framework combines these two approaches.
The PC framework is implemented in the context of decision trees. From the PC point of view, each decision tree is a hierarchy of clusters. The root node of the tree holds all the examples. When traversing the tree (from the root to a leaf), each intermediate node contains less examples than its parent nodes. The connections between nodes represent available paths that each example can take. The decision which path a new example should take is made at the time of traversal and is based on the example’s values in the tuple of predictive variables. The bottommost nodes of the decision tree are called leaf nodes and hold examples most similar to each other. The examples in a leaf are used for calculating the prediction of the leaf.
A decision tree within the PC framework is called a predictive clustering tree (PCT). A PCT is predictive as it is able to make predictions. A PCT is a clustering, i.e., a hierarchy of clusters, represented by the tree’s structure. Each node in the tree represents a cluster, which can be explained/described by the conditions/tests that appear in the tree. Each node holds a test and if we combine all the tests from the root node to the selected node, we get the description of the cluster at the selected node. Several different predictive clustering methods (Blockeel and Struyf 2002; Struyf and Džeroski 2006; Kocev 2011; Ženko 2007; Vens et al. 2008; Slavkov et al. 2010) are already implemented in the CLUS software package and are available at http://sourceforge.net/projects/clus/.
3.2 Learning a single PCT
The induction of a PCT is similar to the induction of a standard decision tree and follows the TDIDT (top down induction of decision tree) algorithm. Algorithm 1 shows the pseudo code for PCT induction. Considering the MTR task in the context of ROS tree ensembles, the PCT induction algorithm takes three inputs:
(i) a dataset S, (ii) a function \(\delta _c(X)\) that randomly samples c predictive variables from dataset X without replacement and (iii) a set of target variables \(R_t\), that the learning process should consider.
The typical PCT considers all predictive and target attributes and is induced by selecting \(\delta _c(X) = \delta _{D}(X) = D\) and \(R_t=T\), where D and T are sets of predictive and target variables respectively. The \(\delta _c\) and \(R_t\) parameters are needed when inducing PCTs in the scope of ensemble learning with ROS, which we describe in detail in Sect. 3.3.
There are many ways (i.e., heuristics) to select the best possible split in a decision tree. PCT uses the reduction of variance as a measure for the quality of a split. The heuristic function, intuitively, guides the data partitioning in such a way, that the homogeneity of the clusters, from the root to the leaf nodes, increases and the resulting tree model contains the most similar examples in the smallest clusters (leaf nodes of the tree). The reduction in cluster variance is a direct result of partitioning \(\mathcal {P}\), according to test t (see line 5 of Algorithm 2). A part of the proposed extension ROS is visible in line 2 of the same algorithm. \(\varPi (S,R_d,R_t)\) is a projection function that reduces the original dataset S by only considering predictive attributes from the set \(R_d\) and target variables from the set \(R_t\).
The reduction of variance calculation is instantiated based on the type of machine learning task addressed. In this paper, we focus on multitarget regression and the variance (Eq. 1) is calculated as the sum of the variances of the individual target variables. \(Y_i\) is the vector of continuous values of the ith target variable in the set of examples S. The variance is calculated using the standard deviation of the values in each vector \(Y_i\). The variances of individual target attributes are normalized in order to have them on the same scale. When different target attributes span different scales that are not in the same range, the effect of variance of one variable could be much greater than the effect of another variable with a smaller range. Normalization is needed so that each target attribute contributes equally to the heuristic score.
Regular PCTs make predictions based on the examples in the leaf nodes. Specifically, the Prototype function (line 9 of the Algorithm 1) calculates the arithmetic mean of every target variable over all the examples in a given node. If needed, the prototype calculation function can be easily adapted to better address a specific task. The BestTest function only calculates the heuristic value on \(S_R\) (the reduced subset of S). This, however, does not restrict the Prototype function, which can make predictions for all output variables, even if some of them did not contribute to the calculation of the heuristic value \(h^*\). We will discuss this further later in Sect. 3.3.3.
In addition to regular PCTs, we also consider extraPCTs (Kocev and Ceci 2015). These PCTs are induced exactly the same as described in Algorithm 1. However, the extraPCT finds the split points in a different manner (see Algorithms 3 and 4). The split point is randomly selected for each considered predictive attribute. The evaluation of splits with random split points is performed using the same procedure as for regular PCTs.
3.3 Ensembles with ROS
An ensemble is a set of models, called base predictive models. Ensemble models are not considered interpretable, but they generally achieve better predictive performance than individual models, which is usually the reason for using them. The downside of using ensembles is their computational complexity: The cost of learning and using an ensemble model is the sum of the corresponding costs for all of its base models. Predictions for new examples are made by querying base models and combining their predictions.
3.3.1 Generating output space partitions
The proposed ensemble approach introduces randomization in the output space. Whereas regular PCTs simultaneously consider the whole target space in the heuristic used for tree construction, ROS considers a different random subset of it for each base model in the ensemble. Each base model is consequently learned by considering only those targets that are included in the randomly generated partition provided to it (see the call of function \(\varPi \) in Algorithm 2, line 2).
ROS creates the output space partitions (subspaces) in advance, i.e., the partitions are independent of the algorithm for learning a single model. ROS generates a different subspace for every ensemble constituent. Thus, the number of generated subspaces (base models) equals b. Algorithm 5 constructs the subspaces. The algorithm has the following parameters: (i) number of subspaces b to generate, (ii) function \(\theta (X,v)\) that uniformly at random samples without replacement of subset from the set X and (iii) set/space of target attributes T, from which subspaces are created.
In the first step, we create an empty list that will contain all the subspaces, i.e., \(G = \small [G_1, G_2, \dots , G_b \small ]\). The first generated subspace is T and includes all target variables, i.e., the corresponding PCT considers all target attributes. This is needed to ensure that all targets are being considered at least once. We generate the remaining \(b1\) subspaces with the \(\theta \) function, which has a parameter v. An example of a \(\theta \) function could return a random selection of 25% (\(v=\frac{1}{4}\)) of items in the set provided as input. If one defines \(\theta (X) = X\), then all ensemble constituents will always consider all targets, which is what a regular ensemble of PCTs does. This function is a parameter of our overall ensemble learning algorithm and we investigate its influence in Sect. 5.
Algorithm 6 describes the random sampling \(\theta \) function used in our experiments. The function \(\theta (X,v)\) uniformly at random samples \(\lceil v \cdot X \rceil \) items from the set X, where v represents the percentage of X we want to sample. This algorithm always samples a fixed number of attributes.
3.3.2 Building the ensembles
With all preliminaries laid out, we can now describe our overall process for learning an ensemble of PCTs for MTR. We use three ensemble building methods (bagging, random forests and extremely randomized trees) that have been extended to support multitarget outputs and use PCTs (Bag and RF) and extraPCTs (ET) as base models. Algorithm 7 is generic and shows how the ensembles are built. We use the values in Table 1 to describe its custom initialization for the ensembles for multitarget regression with random output selections. All methods we consider use the same input parameters: (i) S is the dataset, (ii) the \(\gamma (X)\) function samples the dataset X, (iii) the \(\delta _X(D)\) function randomly selects X predictive attributes, considered at each node during learning and (iv) G is the list of subspaces generated by the GenSubspaces function and G is the number of ensemble constituents.
Bagging (Breiman 1996) is short for bootstrap aggregating. It is an ensemble method that uses bootstrap replication of the training data to introduce randomization in the learning dataset. Such perturbations of the learning set have proven useful for unstable base models, such as decision trees, but can generally be used by any model type. A bootstrap replicate \(S^*\) of a dataset S, is again a dataset, that has been randomly sampled from S. Sampling with replacement is repeated until both datasets are of equal size (i.e., \(S = S^*\)).
Random forests (Breiman 2001) work in a similar fashion to bagging. This ensemble method also starts with bootstrap replicates, that introduce randomization in the instance space. However, it additionally introduces randomization in the predictive attribute space by randomizing the algorithm for the base predictive models. The \(\delta _c\) parameter of the PCT induction algorithm (see Algorithm 1) is instantiated as shown in Table 1. This causes the random forest ensemble method to only consider a subset of randomly selected predictive attributes from the set D of all predictive attributes, while searching for the best split for a node. This process of random selection of predictive attributes is then repeated afresh at each node, yielding different subsets of predictive attributes. The function \(\delta _c(D)\) can be defined to return any number of items from the set D between 1 and D, but the recommended setting by Breiman (2001) is \(\sqrt{D}\), which is what we use.
Extremely randomized trees (Geurts et al. 2006) are very unstable decision trees. It therefore only makes sense to use them in the ensemble setting. It has two distinctive properties with respect to the other two methods: (i) the dataset is not perturbed by applying bootstrapping and (ii) the BestSplit method used by extraPCTs is shown in Algorithm 3. Extra trees select at random k predictive attributes and for each of them, randomly select a split point.^{Footnote 1} Each split is then evaluated and the one with the best heuristic value \(h^*\) is selected. Algorithm 4 shows how the random split points are determined. The recommended number (Geurts et al. 2006) of predictive attributes to be considered at every split is D, which is reflected in our ensemble initialization for extra trees (see Table 1).
3.3.3 Making predictions
An ensemble makes predictions by combining the predictions of its base models. Each base predictive model gives its predictions to the aggregation function, which takes all the votes and decides on the final prediction of the ensemble. In general, the aggregation function is a parameter and there are many ways to combine the votes of the base predictive models: averaging the predictions, majority vote, introducing weights for individual models, introducing preferences based on domain knowledge, and so on. In this paper, we propose two different aggregation functions used in conjunction with the proposed method: (i) total averaging and (ii) subspace averaging.
Total averaging takes all the predictions of the base models and averages them. Each base model gives predictions for all of the T targets. We calculate the average as the arithmetic mean: Final predictions for the ith target attribute (\(\widehat{y}_i\)) are computed as \(\widehat{y}_i=\frac{1}{b}\sum _{j=1}^{b}y_{i}^{j}\), where \(y_{i}^{j}\) represents the prediction of jth base model for the ith target attribute.
Subspace averaging considers only the predictions made by the base predictive models for the targets used to learn them. In other words, the prediction for a given target is averaged over only those base models, for which that target was considered in the heuristic during learning (see the input parameter \(R_t\) in Algorithm 2). The final predictions for the ith target attribute are computed as
where \(\mathbb {1}(X)\) is an indicator function, which returns 1, if the argument X is true and 0 otherwise; \(G_j\) (see Sect. 3.3.1) is the subspace of target attributes, considered when learning the jth ensemble constituent, and \(a_i\) denotes the ith target. The denominator is the number of ensemble constituents for which \(a_i\) was considered during learning and is nonzero because every target attribute is considered by at least subspace \(G_1\).
3.4 Computational complexity analysis
From the work of Kocev et al. (2013) and the assumption that the decision tree is balanced and bushy (Witten and Frank 2005), it follows that the computational complexity of learning a single multitarget PCT is \(\mathcal {O}(dNlog^2N) + \mathcal {O}(dtNlogN) + \mathcal {O}(NlogN)\), where N is the number of instances, d is the number of predictive attributes and t is the number of target attributes in the dataset. Similarly, from the work of Kocev and Ceci (2015) and with the same assumption, it follows that the computational complexity of learning a single multitarget extraPCT is \(\mathcal {O}(ktNlogN) + \mathcal {O}(NlogN)\), where k is the number of randomly sampled predictive attributes at each split. In general, learning an ensemble of b base models has the complexity of learning all of its constituents. In our case, that amounts to \(b(\mathcal {O}(dNlog^2N) + \mathcal {O}(dtNlogN) + \mathcal {O}(NlogN))\) for bagging and random forests of PCTs and \(b(\mathcal {O}(ktNlogN) + \mathcal {O}(NlogN))\) for random forests of extraPCTs.
The computational complexity also depends on the use bootstrapping and the amount of predictive and/or target attributes considered for each base model. Computational cost of bootstrapping is \(\mathcal {O}(N)\) and the number of instances considered in that case equals \(N'=0.632 \cdot N\) (Breiman 1996). Bootstrapping is not used for learning extraPCTs.
Taking into account the fact that random forests also sample the input space (through the sampling function \(\delta _c(D)\)), the number of predictive variables actually considered by the base models is \(d'=c\) (see the definition of \(\delta _c\) in Sect. 3.2). The sampling of predictive variables happens at every node split, so the complexity of data subsampling is \(\mathcal {O}(d'logN')\).
ROS uses additional sampling of the target space. The function \(\theta (X,v)\) (see Algorithm 6) is used to sample from the target space. The sampled subsets are always of equal cardinality, which is controlled by the parameter \(v \in (0.0,1.0)\). However, the first subset always includes all target attributes (see line 2 in Algorithm 5). We define the variable \(t'\) as the average target subspace cardinality considered by the base models as: \(t'= \frac{1}{b}\big ((b1)\cdot \lceil v \cdot t\rceil + t \big )\). The complexity of the sampling function \(\theta (X,v)\) (see Algorithm 6) is low. All operations of the sampling algorithm have complexity of \(\mathcal {O}(1)\). The while loop in line 2 is executed \(\lceil v \cdot X\rceil \)times. Each randomly sampled attribute a has equal probability of being included in the resulting set Q. Thus, the complexity of the sampling algorithm, which samples \(b1\) times, is linear and proportionate to \((b1)\cdot \mathcal {O}(v \cdot t)\). Considering all of the above, the complexity of the ROS ensembles is
The ratio between the full output space size and the one considered by ROS is constant and is proportionate to \(\frac{t'}{t} = \frac{(b1)\cdot v \cdot t + t}{b\cdot t} = \lim \limits _{b \rightarrow \infty }{\frac{(b1)\cdot v + 1}{b}} = v\). The overall complexity of ROS is consequently reduced in the parts that correspond to the selection of the best split. We expect a linear decrease in complexity in those terms. Otherwise, the overall complexity is still as described by Kocev et al. (2013), Kocev and Ceci (2015).
Ensembles usually contain many base models which results in longer times to make a prediction. Therefore, we also address the complexity of making predictions. Under the previously mentioned assumption about decision trees being balanced and bushy, the average depth of a decision tree is actually the average length of the path that has to be traversed by an instance in order to get to the prediction. The complexity of making a prediction with a singletarget decision tree is therefore \(\mathcal {O}(log(N))\). In a global MTR scenario, all target variables are predicted simultaneously with the same complexity as that of making a prediction with a singletarget tree. When we switch to the ensemble setting, the complexity increases linearly with the number of base models in the ensemble: \(b\cdot \mathcal {O}(log(N))\). If we are approaching the problem of MTR locally, each target is predicted with its own ensemble and that additionally increases the complexity in proportion to the number of target variables: \(bT\cdot \mathcal {O}(log(N))\).
4 Experimental design
To evaluate the performance of the ROS ensembles for MTR, we performed extensive experiments on benchmark datasets. This section presents: (i) the experimental questions addressed, (ii) the evaluation measures used, (iii) the benchmark datasets and (iv) the experimental setup (including the parameter instantiations for the methods used in the experiments).
4.1 Experimental questions
In our experiments, we construct PCT ensembles for MTR by using the described ensemble extension method ROS. In order to better understand the effects of ROS, we investigate the resulting ensemble models across three dimensions.
First, we are interested in the convergence of their predictive performance as we increase the number of PCTs in the ensemble. We want to establish the number of base models needed in an ensemble to reach the point of performance saturation. We consider an ensemble saturated, when adding additional base models to it would not bring statistically significant improvement in terms of predictive power.
Next, we are interested, whether the proposed extension can improve the predictive performance over the performance of the original ensembles. Learning on subsets of targets could exploit additional structural relations that may be overlooked by the original ensemble approaches.
Finally, as we have theoretically derived in Sect. 3.4, we expect that the dimensionality reduction of the output space will yield improvements in terms of computational efficiency. Specifically, we are interested in the running times of the ROS ensemble approaches and the sizes of the resulting models.
The specific experimental questions we pose relate to the above three dimensions we are interested in. The experiments and their evaluation have been designed with the following research questions in mind:

1.
How many base models do we need in ROS ensembles in order to reach the point of performance saturation?

2.
What is the best value for the portion of target space to be used within such ensembles? Is this portion equal for all evaluated ensemble methods?

3.
Does it make sense to change the default aggregation function of the ensemble that uses the prediction for all targets? Can this improve predictive performance?

4.
Considering predictive performance, how do ROS ensemble methods compare to the original ensemble methods?

5.
Is ROS helpful in terms of time efficiency?

6.
Do ROS models use less memory than the models trained with the original ensemble methods?

7.
How ROS models compare to other output transformation methods?
4.2 Evaluation measures
In order to understand the effects that ROS has on the learning process, we first need to evaluate the models induced by the ROS ensemble approaches. In machine learning, empirical evaluation is most commonly used to achieve this goal, that assesses the performance of a given model in terms of evaluation measures. Below we describe the measures that we use for assessing predictive power, time and space complexity.
The predictive performance of a MTR model is assessed by using the average relative root mean squared error (aRRMSE), which averages the relative root mean squared errors (RRMSE) for the individual target variables. RRMSE is a relative measure calculated against the baseline model that predicts the arithmetic mean of all values of a given target in the learning set. Specifically, the value \(\overline{y}_{i}\) in Eq. 5 is the prediction of the baseline model for the ith target variable, while the value \(\hat{y}^{(e)}_i\) represents the predicted value for the ith target variable of the example e.
We also monitor how much our models overfit the training data by calculating their relative decrease of performance on the testing data with respect to that on the training data. Smaller values mean less overfitting, with zero being the ideal score. We calculate the overfitting score with Eq. 6.
The efficiency is measured in terms of execution times and sizes of the induced models. Time efficiency is measured with the CPU time needed to induce (train) the model (i.e., learning time). For ROS ensembles, this includes the target space decomposition. We also measure the average time needed to make a prediction (i.e., prediction time). Space efficiency is measured with the total number of nodes in the tree model (intermediate and leaf nodes): the smaller the better. Induction times and model sizes are summed over all ensemble constituents.
4.3 Data description
To evaluate the proposed method, we use 17 benchmark datasets that contain multiple continuous target attributes and are mainly from the domain of ecological modeling. Table 2 shows the main characteristics of the considered datasets. In order to have as general evaluation as possible, we use datasets of different sizes in terms of number of instances, number of predictive and number of target attributes.
4.4 Experimental setup
We designed the experimental setup according to the experimental questions posed in Sect. 4.1. First, we describe all parameter settings of the ROS ensemble methods. We then outline the procedures for statistical analysis of the results.
We consider three types of ensembles: bagging and random forests of PCTs and extraPCTs. In order for Algorithm 7 to simulate these three ensemble methods, we set its parameters to the values given in Table 1. Following the recommendations from Bauer and Kohavi (1999), the trees in the ensembles are unpruned. Our experimental study considers different ensemble sizes, i.e., different numbers of base models (PCTs) in the ensemble, in order to investigate the saturation of ensembles and to select the saturation point.
First, we construct ensembles without ROS (Bag, RF, ET) that use the full output space for learning the base predictive models. This means that the list G contains b sets, where each set contains all the attributes from the set T (target attributes), i.e., \(G = \left\{ T,T,\ldots ,T\right\} \), where \(G = b\).
The second part of our experiments is concerned with the proposed extension—Random Output Selections. We start with the parametrization of the GenSubspaces function (Algorithm 5), which takes as input b, T and the sampling function \(\theta (X,v)\) (see Algorithm 6). We consider four values for v in the allowed range (0.0, 1.0), namely \(\frac{1}{\sqrt{T}}, \frac{1}{4},\frac{1}{2}, \frac{3}{4}\). Additionally, we use two ensemble predictions aggregation functions: total averaging and subspace averaging. Table 3 summarizes the parameter values considered in our experiments.
The third part of our study focuses on the comparison of our ROS ensemble methods to baseline methods. To that end, we also train multitarget PCTs and ensembles of singletarget PCTs on each of the 17 benchmark datasets. Ftest pruning is applied to single multitarget PCTs. The F value is selected using internal 3fold crossvalidation. We build one ensemble for each target variable. Ensembles contain 100 base models and are built by using the same parameters as the original ensembles.
We estimate the predictive performance of the considered methods by using 10fold crossvalidation. All methods use the same folds. For statistical evaluation of the obtained results, we follow the recommendations from Demšar (2006). The Friedman test (Friedman 1940), with the correction by Iman and Davenport (1980), is used to determine statistical significance. In order to detect statistically significant differences, we calculate the critical distances (CD) by applying the Nemenyi (1963) or Dunn (1961) posthoc statistical tests. Both posthoc tests compute critical distance between the ranks of considered algorithms. The difference is that Nemenyi posthoc test compares the relative performance of all considered methods (all vs. all), whereas Bonferroni–Dunn posthoc test compares the performance of a single method to other methods (one vs. all). The results of these tests are presented with average rank diagrams (Demšar 2006), where methods connected with a line have results that are not statistically significantly different. All statistical tests were conducted at the significance level \(\alpha = 0.05\). Statistical tests have been calculated for two variants of the results: per dataset (using aRRMSE value for each dataset) and per target (using the RRMSE values for all targets of all datasets). We used the Bonferroni–Dunn (CD is shown as a dotted blue line) posthoc test to present results in Sect. 5.4 and Nemenyi posthoc test otherwise (CD is shown as a solid red line).
The experiments were executed on a heterogeneous computing infrastructure, i.e., the SLING grid, which can affect timesensitive evaluations. To avoid having incomparable measurements of running times, we run timesensitive experiments separately by using a single computational node.
5 Results and discussion
Here we present the results of our comprehensive experimental study. Considering a large number of datasets (17) and several ensemble methods, we present the results in terms of predictive performance (aRRMSE, overfitting score—OS), time complexity (learning and prediction time) and space complexity (model size). In the presentation of time complexity results, we focus on two datasets Forestry Kras and OES 10, that have relatively large output spaces (11 and 16 targets, respectively). The selected datasets also differ in the number of examples: Forestry Kras has many whereas OES 10 has few. For reference, all other results are available in “Appendix”.
The presentation and discussion of the results follows the experimental questions from Sect. 4.1. First, we examine the convergence of original and ROS ensembles. Next, we focus on selecting the output space size. We experiment with four different output space sizes (see Table 3), that have consequently been used. This parameter is crucial because it introduces additional point of randomization in all three considered ensemble methods. In that sense, ROS can also be seen as a localization process: the constructed base models are tailored to a specific output subspace. We recommend values for this parameter for each ensemble learning method. Furthermore, we show the effects of changing the aggregation functions in our ensembles. Finally, we use the recommended parameters for ROS and provide an overall evaluation, by comparing the extended ensembles to the original ones. We compare the ROS methods to the baseline methods in terms of predictive power, running times and model sizes.
5.1 Ensemble convergence
The saturation points of the original ensembles (Bag, RF and ET) are located between 50 and 75 base models (Fig. 1). These findings are in line with the work of Kocev et al. (2013), where Bag and RF saturate with 50 base models and Kocev and Ceci (2015), RF and ET saturate with 100 and 75 base models respectively. We attribute this difference to two factors: (i) we use a different and larger number of datasets and (ii) the number of target attributes per dataset in our study is considerably larger. All in all, we consider the original ensembles with 75 base models saturated.
We next investigate the saturation of the ROS ensembles. A subset of the results for ensembles with 50, 100, 150 and 250 models are reported in Fig. 2 and illustrate the saturation of ROS ensembles for all three considered ensemble methods. Lines on the plots represent different output space sizes. Values in brackets indicate the value for the v parameter. Left and right sides of the plots depict voting with total averaging and subspace averaging respectively. The y axis shows aRRMSE values averaged over all considered datasets. The results show that BagROS and ETROS ensembles saturate between 50 and 100 base models while RFROS ensembles saturate a bit later, between 75 and 100 base models. Figure 2 suggests that ROS ensembles saturate at a large number of base models when subspace averaging is used to aggregate the predictions of the base models. Performance in terms of aRRMSE and overfitting scores of all discussed ensembles with 100 trees (multitarget and singletarget variants) and single multitarget regression trees is presented in “Appendix B”.
5.2 ROS parameter selection
This section describes the selection of the best performing output subspace size and aggregation function. The considered ensemble methods introduce different randomizations in their learning processes, so we cannot assume that ROS has the same influence on all three types of ensemble methods. Figure 2 also suggests that the choice of the aggregation function has a direct effect on performance. We therefore analyze the effect of output subspace size and aggregation function for each ensemble type separately.
We selected candidate values for the ROS parameters based on the curves given in Fig. 2. With both aggregation functions, candidate parameter values for subspace sizes are \(v=\frac{1}{2}\) for BagROS and \(v=\frac{3}{4}\) for RFROS and ETROS. We selected these values because they exhibit the lowest aRRMSE averaged over all datasets used in this study. The averaged saturation curves in Fig. 2 sometimes intertwine and make it difficult to make this decision. In those cases, we selected the parameter values based on the averaged performance of ensembles with 100 trees. Next, we performed a simple analysis by comparing the wins of the two considered aggregation functions using the candidate output space size. For BagROS, it turned out that total averaging had most wins, whereas subspace averaging was dominant for RFROS and ETROS. Our final parameter recommendation is therefore to use total averaging with \(v=\frac{1}{2}\) for BagROS and subspace averaging with \(v=\frac{3}{4}\) for RFROS and ETROS.
5.3 Predictive performance and computational efficiency
Here, we compare the original ensemble methods (Bag, RF, ET) to the ones that use the ROS extension. In addition, ROS is also compared to multitarget PCTs and ensembles of singletarget PCTs. We show the relative performance of the different methods by using the average rank diagrams shown in Fig. 3.
Figure 3 depicts two average rank diagrams: one per dataset and one per target. The per dataset diagram is based on aRRMSE value, one per dataset. Both analyses show that ensembles statistically significantly outperform individual multitarget PCTs, i.e., multitarget PCTs perform significantly worse than the ensemble methods. The per dataset analysis shows no statistically significant difference in terms of predictive performance among the other methods. We can however note that BagROS and ETROS outperform their original counterparts and ensembles of singletarget PCTs. RFROS performs on par with the original bagging and random forest ensembles, but worse than the other ROS ensembles (BagROS and ETROS). The best performing of all methods is ETROS.
The per target analysis detects two statistically significant differences in performance. First, with the exception of ETST, ETROS outperforms all other methods with statistical significance. Second, BagROS outperforms RFROS, which performs worst of all ensemble methods. All original ensembles (Bag, RF and ET) show no statistically significant differences in performance. All in all, looking at the big picture, ROS ensembles generally perform better than their original counterparts, with the exception of random forests.
In Table 4, we show the predictive performance (aRRMSE) for two highlighted datasets: Forestry Kras and OES 10. The table contains results for the baseline ensembles (Bag, RF and ET) as well as the extended ROS ensembles, individual multitarget PCTs and ensembles of singletarget PCTs. All ensembles contain 100 base models. The ensembles of singletarget PCTs contain 100 base models per target.
For the Forestry Kras dataset, the proposed ROS methods do not have a notable effect on the predictive performance (aRMMSE) of the three ensemble methods. Similar findings are observed when calculating the overfitting score (OS): the ROS ensembles overfit the training data to the same extent as their original counterparts. Next, multitarget PCTs and ensembles of singletarget PCTs have the worst predictive performance. The difference in predictive performance between ensembles of singletarget PCTs and other ensembles is minimal. However, notable differences exist in terms of time needed for learning the model and making predictions. Namely, the multitarget ensembles have significantly lower learning and prediction times than the singletarget ensembles. The ROS ensembles train the ensembles faster (for Bagging and Extra trees) but still in the same order of magnitude as the original methods. Not surprisingly, single multitarget PCTs have the shortest learning times at the cost of lowest predictive performance. Similar findings are observed when considering model complexity (measured as total number of nodes in all of the trees in an ensemble or in a multitarget PCT). Average prediction times per instance do not differ across the different approaches. This is expected since all base models are trees and no additional computation overhead is needed to calculate the predictions. Ensembles of singletarget PCTs always have an order of magnitude higher learning and prediction times, as well as model complexity as a separate ensemble is learned for predicting each target.
For the OES 10 dataset, improvements in predictive performance are present. The proposed ROS ensembles outperform their original counterparts. Furthermore, the original ensembles were outperformed by the ensembles of singletarget PCTs. The predictive performance gain with ETROS w.r.t. ET is substantial. This is an interesting observation and suggests that ROS could lift predictive performance on smaller datasets with larger output spaces, especially for heavily randomized methods such as extra trees. One possible explanation is that the sampling of input variables in ET, coupled with the small number of examples in the dataset and absence of bootstrapping, introduces a relatively high level of noise in the learning process. The ROS ensemble then actually reduces the effect of this noise at the level of individual base models by specializing them for a smaller output space. This can also explain the small gains for bagging and random forests with the ROS extension on this dataset, because the bootstrapping actually negatively impacts the overall ensemble performance. By inspecting the overfitting score, we note that ROS ensembles consistently exhibit a decreased score w.r.t. ensembles of singletarget PCTs and perform comparably w.r.t. ensembles of multitarget PCTs. Learning and prediction times, as well as model complexity, follow similar patterns as for the Forestry Kras dataset.
5.4 Comparison with other output space transformation methods
In order to put ROS in the broader context of MTR methods with output space transformations, we compare the predictive performance of ROS ensembles and ensembles built with the competing methods proposed in Joly et al. (2014) and Tsoumakas et al. (2014). We have selected these specific methods because they all specialize individual models in the ensemble to a subset of target variables.
Joly et al. (2014) propose ensembles of multioutput regression trees, where each individual tree is built by using a projected output space. Gaussian, Rademacher, Hadamard and Achlioptas projections are used. The goal is to truncate the output space in order to reduce the number of calculations needed to find the best split, which is the main computational burden while building a decision tree. While learning the ensemble, each tree is given a different output space projection. They use two different ensemble methods: Random forests and Extra trees. We dubbed their method Random projections and its two variants as RPRF and RPET. Note that Random projections can not handle nominal attributes and missing values. Hence, the nominal attributes have been converted to numeric scales and missing values have been imputed with the arithmetic mean of that feature.
Tsoumakas et al. (2014) propose an ensemble method called Random Linear Target Combinations for MTR (RLC). They construct new target variables via random linear combinations of existing ones. The data must be normalized in order for the linear combinations to make sense, i.e., to prevent targets on larger scales dominating over the ones with lower scales and thus deteriorating the learning process. The output space is transformed in such a way that each linear combination consists of k original output features. Each combination is then considered for learning one ensemble member. The transformation of the output space matrix \(\varvec{Y}\) (\(m \times q\)) is achieved via a coefficient matrix \(\varvec{C}\) of size \(q \times r\) filled with random values uniformly chosen from [0, 1]. Columns of the matrix \(\varvec{C}\) represent coefficients of the linear combination of the target variables. By multiplying the two matrices, we get the transformed output space \(\varvec{Y'}=\varvec{Y}\varvec{C}\) (\(m \times r\)) that is then used for training. A userselected regression algorithm can then be applied on the transformed data.
We present the results using average rank diagrams in Fig. 4, while the complete experimental results are available in “Appendix C” (Table 11). Figure 4 depicts two average rank diagrams: one per dataset and one per target. The per dataset diagram is based on the aRRMSE values, one per dataset. The per target diagram is based on RRMSE values with multiple targets per dataset. The per dataset analysis shows no statistically significant differences between the predictive performances of the considered ensemble methods. We can however note, that performances of ETROS and RPET ensembles seem to be on par (with a minimal advantage of the ETROS ensemble). The per target analysis detects two statistically significant differences. First, ETROS statistically significantly outperforms all other methods with the exception of RPET. Second, RLC and RFROS ensembles are on par and both are statistically significantly outperformed by the other methods. Third, BagROS, RPRF and RPET perform equally well. All in all, ETROS ensembles generally perform better than the other considered ensemble methods.
5.5 Summary of the results
We summarize the main findings of the extensive experimental work presented in the paper by answering the experimental questions posed in Sect. 4.1.

1.
How many base models do we need in ROS ensembles in order to reach the point of performance saturation?
The saturation point of the original PCT ensembles is between 50 and 75 base models. BagROS and ETROS ensembles saturate between 50 and 100 base models. Especially RFROS ensembles saturate a bit later, at 75 to 100 base models learned. In the comparative analysis of performance, we consider ensembles with 100 base models (in order to make the comparison fair for all considered methods).

2.
What is the best value for the portion of target space to be used within such ensembles? Is this portion equal for all evaluated ensemble methods?
The most appropriate size of the portion of target space to be used varies with the ensemble method. The results suggest to use \(v=\frac{1}{2}\) for BagROS and \(v = \frac{3}{4}\) for RFROS and ETROS.

3.
Does it make sense to change the default aggregation function of the ensemble that uses the prediction for all targets? Can this improve predictive performance?
Changing the aggregation function changes the behaviour of the ROS ensembles. For BagROS, it can even decrease the predictive performance, so we recommend using the standard aggregation function, i.e., total averaging. For RFROS and ETROS we recommend making predictions with subspace averaging.

4.
Considering predictive performance, how do ROS ensemble methods compare to the original ensemble methods?
Using ROS can improve the predictive performance of PCT ensembles. This is especially notable when using ETROS with small datasets with larger output spaces.

5.
Is ROS helpful in terms of time efficiency?
The observed learning times for ROS methods can be substantially lower than the ones of their original counterparts. This especially holds for large datasets. Prediction times, however, do not change.

6.
Do ROS models use less memory than the models trained with the original ensemble methods?
Ensemble models obtained with ROS have sizes comparable to the ensemble models produced by the original ensemble models.

7.
How ROS models compare to other output transformation methods?
ETROS ensembles generally perform better than ensembles of other competing output transformation methods.
6 Conclusions
This work has addressed the task of learning predictive models that can predict the values of multiple continuous variables for a given input tuple, referred to as multitarget regression (MTR): MTR is a task of predicting structured outputs, i.e., structured output prediction. There are two general approaches to solving tasks of such nature. The first, local approach, learns separate models for every component of the predicted structure, whereas the second, global approach, learns one model capable of predicting all components of the structure simultaneously.
We have proposed novel ensemble methods for MTR. An ensemble is a set of predictive models whose predictions are combined and yield the model output. The proposed methods build further on of well known methods for learning ensembles that have been extended to structured outputs. The base models we have considered are predictive clustering trees (PCTs) for MTR. The methods we have proposed are based on the ensemble extension method, i.e., ROS—Random Output Selections. For each ensemble constituent (PCT), the proposed extension randomly selects targets that are considered while learning that particular base model. We perform an extensive experimental evaluation of three ensemble methods extended with ROS, i.e., bagging, random forests and extraPCTs. The performance has been evaluated on 17 benchmark datasets of varying sizes in terms of number of examples, number of predictive attributes and number of target attributes.
The results show that the proposed extension has a favorable effect, yielding lower error rates and shorter running times. ROS coupled with bagging and extra trees can outperform the original ensemble methods. Random forests do not benefit from ROS in terms of predictive power, but do benefit in terms of shorter learning time. ETROS (extra trees with ROS) statistically significantly outperform all original ensemble methods and their ROS (when analyzing predictive performance on a per target basis). We also conducted experiments with three competing methods showing that the proposed method yields the best performance.
We have also provided a computational complexity analysis for the proposed ensemble extension. Our experiments confirm the results of the theoretical analysis. Ensembles with ROS can yield better predictive performance, as well as reduce learning times, whereas the sizes of the induced models do not change notably.
We plan future work along several possible directions. To begin with, the aggregation function has an effect on the ensemble predictive performance, as the present work demonstrates. We plan to design a new aggregation function by combining total averaging and subspace averaging, hoping to achieve better performance and better understanding of the effect of subspace averaging. Additionally, we can use outofbag errors to derive aggregation weights, such that ensemble constituents with higher error rates would make a smaller contribution to the final prediction. Furthermore, we could perform biasvariance decomposition of the error of all the investigated methods and investigate the sources of errors.
Following an alternative direction, the process of generating target subspaces could also be adapted. The current approach generates target subspaces at random, which is not necessarily the best approach. The relations between target variables could be exploited in order to generate a smaller set of more sensible subspaces.
The final direction we intend to follow is the adaptation of the proposed approaches to other structured output prediction tasks, such as multitarget classification, (hierarchical) multilabel classification, and timeseries prediction. For all of these tasks, global random forests are already being used to obtain feature rankings (in the context of predicting structured outputs). ROS could improve this approach/ranking, by considering subsets of the set of target attributes in the process of producing ranks.
Notes
This is in contrast to the normal procedure of considering multiple/all possible split points for each attribute, before selecting the best one.
References
Abraham, Z., Tan, P. N., Winkler, J., Zhong, S., Liszewska, M., et al. (2013). Position preserving multioutput prediction. In Joint European conference on machine learning and knowledge discovery in databases (pp. 320–335), Springer.
Aho, T., Ženko, B., Džeroski, S., & Elomaa, T. (2012). Multitarget regression with rule ensembles. Journal of Machine Learning Research, 13, 2367–2407.
Alvarez, M. A., Rosasco, L., Lawrence, N. D., et al. (2012). Kernels for vectorvalued functions: A review. Foundations and Trends$\textregistered $ in Machine Learning, 4(3), 195–266.
Appice, A., & Džeroski, S. (2007). Stepwise induction of multitarget model trees. In Machine Learning: ECML 2007, LNCS (Vol. 4701, pp. 502–509). Springer.
Appice, A., & Malerba, D. (2014). Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering. Data Mining and Knowledge Discovery, 28(5–6), 1266–1313.
Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1), 105–139.
Blockeel, H. (1998). Topdown induction of first order logical decision trees. Ph.D. thesis, Katholieke Universiteit Leuven, Leuven, Belgium.
Blockeel, H., Džeroski, S., & Grbović, J. (1999). Simultaneous prediction of multiple chemical parameters of river water quality with TILDE. In Proceedings of the 3rd European conference on PKDD—LNAI (Vol. 1704, pp. 32–40). Springer.
Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Topdown induction of clustering trees. In Proceedings of the 15th international conference on machine learning (pp. 55–63), Morgan Kaufmann.
Blockeel, H., & Struyf, J. (2002). Efficient algorithms for decision tree crossvalidation. Journal of Machine Learning Research, 3, 621–650.
Borchani, H., Varando, G., Bielza, C., & Larrañaga, P. (2015). A survey on multioutput regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(5), 216–233.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Breiman, L., & Friedman, J. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1), 3–54.
Debeljak, M., Kocev, D., Towers, W., Jones, M., Griffiths, B., & Hallett, P. (2009). Potential of multiobjective models for riskbased mapping of the resilience characteristics of soils: Demonstration at a national level. Soil Use and Management, 25(1), 66–77.
Deger, F., Mansouri, A., Pedersen, M., Hardeberg, J. Y., & Voisin, Y. (2012). Multiand singleoutput support vector regression for spectral reflectance recovery. In 2012 eighth international conference on signal image technology and internet based systems (SITIS) (pp. 805–810). IEEE.
Demšar, D., Džeroski, S., Larsen, T., Struyf, J., Axelsen, J., BrunsPedersen, M., et al. (2006). Using multiobjective classification to model communities of soil. Ecological Modelling, 191(1), 131–143.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.
Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56(293), 52–64.
Džeroski, S., Demšar, D., & Grbović, J. (2000). Predicting chemical parameters of river water quality from bioindicator data. Applied Intelligence, 13(1), 7–17.
Džeroski, S., Kobler, A., Gjorgjioski, V., & Panov, P. (2006). Using decision trees to predict forest stand height and canopy cover from LANSAT and LIDAR data. In Managing environmental knowledge: EnviroInfo 2006: Proceedings of the 20th international conference on informatics for environmental protection (pp. 125–133). Aachen: Shaker Verlag.
Džeroski, S. (2007). Towards a general framework for data mining (pp. 259–300). Berlin: Springer. https://doi.org/10.1007/9783540755494_16.
Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11, 86–92.
Gamberger, D., Ženko, B., Mitelpunkt, A., Shachar, N., & Lavrač, N. (2016). Clusters of male and female alzheimers disease patients in the Alzheimers disease neuroimaging initiative (ADNI) database. Brain Informatics, 3(3), 169–179.
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42.
Gjorgjioski, V., Džeroski, S., & White, M. (2008). Clustering analysis of vegetation data. Technical report 10065, Jožef Stefan Institute.
Han, Z., Liu, Y., Zhao, J., & Wang, W. (2012). Real time prediction for converter gas tank levels based on multioutput least square support vector regressor. Control Engineering Practice, 20(12), 1400–1409.
Ikonomovska, E., Gama, J., & Džeroski, S. (2011). Incremental multitarget model trees for data streams. In Proceedings of the 2011 ACM symposium on applied computing (pp. 988–993). ACM.
Iman, R. L., & Davenport, J. M. (1980). Approximations of the critical region of the Friedman statistic. Communications in Statistics: Theory and Methods, 9(6), 571–595.
Izenman, A. J. (1975). Reducedrank regression for the multivariate linear model. Journal of multivariate analysis, 5(2), 248–264.
Jančič, S., Frisvad, J. C., Kocev, D., Gostinčar, C., Džeroski, S., & GundeCimerman, N. (2016). Production of secondary metabolites in extreme environments: Food and airborne Wallemia spp. produce toxic metabolites at hypersaline conditions. PLoS ONE, 11(12), e0169116.
Joly, A. (2017). Exploiting random projections and sparsity with random forests and gradient boosting methods—Application to multilabel and multioutput learning, random forest model compression and leveraging input sparsity. arXiv preprint arXiv:1704.08067
Joly, A., Geurts, P., Wehenkel, L. (2014). Random forests with random projections of the output space for high dimensional multilabel classification. In Joint European conference on machine learning and knowledge discovery in databases (pp. 607–622). Springer.
Kaggle. (2008). Kaggle competition: Online product sales. https://www.kaggle.com/c/onlinesales/data. Accessed July 19, 2017.
Kocev, D. (2011). Ensembles for predicting structured outputs. Ph.D. thesis, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia.
Kocev, D., & Ceci, M. (2015). Ensembles of extremely randomized trees for multitarget regression. In Discovery science: 18th international conference (DS 2015), LNCS, (Vol. 9356, pp. 86–100).
Kocev, D., Džeroski, S., White, M., Newell, G., & Griffioen, P. (2009). Using single and multitarget regression trees and ensembles to model a compound index of vegetation condition. Ecological Modelling, 220(8), 1159–1168.
Kocev, D., Naumoski, A., Mitreski, K., Krstić, S., & Džeroski, S. (2010). Learning habitat models for the diatom community in Lake Prespa. Ecological Modelling, 221(2), 330–337.
Kocev, D., Vens, C., Struyf, J., & Džeroski, S. (2007). Ensembles of multiobjective decision trees. In ECML ’07: Proceedings of the 18th European conference on machine learning—LNCS (Vol. 4701, pp. 624–631). Springer.
Kocev, D., Vens, C., Struyf, J., & Džeroski, S. (2013). Tree ensembles for predicting structured outputs. Pattern Recognition, 46(3), 817–833.
Kriegel, H. P., Borgwardt, K., Kröger, P., Pryakhin, A., Schubert, M., & Zimek, A. (2007). Future trends in data mining. Data Mining and Knowledge Discovery, 15, 87–97.
Levatić, J., Ceci, M., Kocev, D., & Džeroski, S. (2014). Semisupervised learning for multitarget regression. In International workshop on new frontiers in mining complex patterns (pp. 3–18). Springer.
Madjarov, G., Gjorgjevikj, D., Dimitrovski, I., & Džeroski, S. (2016). The use of dataderived label hierarchies in multilabel classification. Journal of Intelligent Information Systems, 47(1), 57–90.
Marek, K., Jennings, D., Lasch, S., Siderowf, A., Tanner, C., Simuni, T., et al. (2011). The Parkinson Progression Marker Initiative (PPMI). Progress in Neurobiology, 95(4), 629–635.
Micchelli, C. A., & Pontil, M. (2004). Kernels for multitask learning. In Advances in neural information processing systems 17—Proceedings of the 2004 conference (pp. 921–928).
Nemenyi, P. B. (1963). Distributionfree multiple comparisons. Ph.D. thesis, Princeton University, Princeton, NY, USA.
Panov, P., Soldatova, L. N., & Džeroski, S. (2016). Generic ontology of datatypes. Information Sciences, 329, 900–920.
Slavkov, I., Gjorgjioski, V., Struyf, J., & Džeroski, S. (2010). Finding explained groups of timecourse gene expression profiles with predictive clustering trees. Molecular BioSystems, 6(4), 729–740.
SpyromitrosXioufis, E., Tsoumakas, G., Groves, W., & Vlahavas, I. (2016). Multitarget regression via input space expansion: Treating targets as inputs. Machine Learning, 104(1), 55–98.
Stojanova, D., Ceci, M., Appice, A., & Džeroski, S. (2012). Network regression with predictive clustering trees. In Data mining and knowledge discovery (pp. 1–36).
Stojanova, D., Panov, P., Gjorgjioski, V., Kobler, A., & Džeroski, S. (2010). Estimating vegetation height and canopy cover from remotely sensed data with machine learning. Ecological Informatics, 5(4), 256–266.
Struyf, J., & Džeroski, S. (2006). Constraint based induction of multiobjective regression trees. In Proceedings of the 4th international workshop on knowledge discovery in inductive databases KDID—LNCS (Vol. 3933, pp. 222–233). Springer.
Szymański, P., Kajdanowicz, T., & Kersting, K. (2016). How is a datadriven approach better than random choice in label space division for multilabel classification? Entropy, 18(8), 282.
Tsoumakas, G., SpyromitrosXioufis, E., Vrekou, A., & Vlahavas, I. (2014). Multitarget regression via random linear target combinations. In Machine learning and knowledge discovery in databases: ECMLPKDD 2014, LNCS (Vol. 8726, pp. 225–240).
Tsoumakas, G., & Vlahavas, I. (2007). Random klabelsets: An ensemble method for multilabel classification. In Proceedings of the 18th European conference on machine learning (pp. 406–417).
Vens, C., Struyf, J., Schietgat, L., Džeroski, S., & Blockeel, H. (2008). Decision trees for hierarchical multilabel classification. Machine Learning, 73(2), 185–214.
Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. Los Altos: Morgan Kaufmann.
Xu, S., An, X., Qiao, X., Zhu, L., & Li, L. (2013). Multioutput leastsquares support vector regression machines. Pattern Recognition Letters, 34(9), 1078–1084.
Yang, Q., & Wu, X. (2006). 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making, 5(4), 597–604.
Ženko, B. (2007). Learning predictive clustering rules. Ph.D. thesis, Faculty of Computer Science, University of Ljubljana, Ljubljana, Slovenia.
Zhang, W., Liu, X., Ding, Y., & Shi, D. (2012). Multioutput LSSVR machine in extended feature space. In 2012 IEEE international conference on computational intelligence for measurement systems and applications (CIMSA) (pp. 130–134). IEEE.
Acknowledgements
We acknowledge the financial support of the Slovenian Research Agency via the grants P20103 and a young researcher grant to MB, as well as the European Commission, through the grants MAESTRA (Learning from Massive, Incompletely annotated, and Structured Data) and HBP (The Human Brain Project), SGA1 and SGA2. SD also acknowledges support by Slovenian Research Agency (via grants J47362, L27509, and N20056), the European Commission (project LANDMARK) and ARVALIS (project BIODIV). The computational experiments presented here were executed on a computing infrastructure from the Slovenian Grid (SLING) initiative.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Michelangelo Ceci and Toon Calders.
Appendices
Appendix A: Average rank diagrams for ROS variants
Below we provide average rank diagrams for all considered ROS variants (BagROS, RFROS, ETROS) for all considered values of the parameter \(v \in \left\{ \frac{1}{4},\frac{1}{2}, \frac{3}{4},\frac{1}{\sqrt{T}}\right\} \) and the two types of prediction averaging functions. One average rank diagram corresponds to one combination of values of the above parameters, for which it compares ensembles of different sizes (10, 25, 50, 75, 100, 150 and 250 base models). The saturation point is the lowest number of trees in the ensemble for which the performance is not significantly different than the best. Saturation points are shown in brackets next to value for parameter v.
1.1 BagROS variants saturation
The average rank diagrams for the BagROS variants are given in Figs. 5 and 6.
1.2 RFROS variants saturation
The average rank diagrams for the RFROS variants are given in Figs. 7 and 8.
1.3 ETROS variants saturation
The average rank diagrams for the ETROS variants are given in Figs. 9 and 10.
Appendix B: Performance results
The complete performance results for the different ensemble types (bagging, RF, ET) and variants (ST, MT, ROS) are given below. Tables 5, 6 and 7 contain results in terms of aRRMSE, while Tables 8, 9 and 10 contain results in terms of overfitting scores.
Appendix C: Performance results compared to other output space transformation methods
This section includes the results of the comparison of the performance of ROS ensembles with the performance of three competing methods: two variants of Random projections method which were proposed by Joly (2017) and RLC ensembles, proposed by Tsoumakas et al. (2014). The predictive performance is measured in terms of aRRMSE.
Method parameters All ensembles contain 100 base models. ROS ensembles were parametrized as described in Sect. 5.2. Random projections variants (RPRF, RPET) use \(m=log(T)\) components in the projected output space where T is the set of target attributes. In addition, Rademacher random projections were used for output space transformations. RPRF used \(k=sqrt(X)\) randomly chosen input features to calculate splits where X is the set of all input features. RPET used \(k=X\). Minimal number of allowed instances in a leaf node was set to 1 for both variants. The code for both variants of Random projections is available at https://github.com/arjoly/randomoutputtrees. RLC was parametrized to use gradient boosting with 4terminal node regression tree as the base regressor with learning rate of 0.1 and 100 boosting iterations. Number of targets that participate in the random linear combinations was set to \(k=2\). RLC is implemented as part of the MULAN library, available at http://mulan.sourceforge.net (Table 11).
Rights and permissions
About this article
Cite this article
Breskvar, M., Kocev, D. & Džeroski, S. Ensembles for multitarget regression with random output selections. Mach Learn 107, 1673–1709 (2018). https://doi.org/10.1007/s109940185744y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s109940185744y