Multitarget regression via input space expansion: treating targets as inputs
 4k Downloads
 41 Citations
Abstract
In many practical applications of supervised learning the task involves the prediction of multiple target variables from a common set of input variables. When the prediction targets are binary the task is called multilabel classification, while when the targets are continuous the task is called multitarget regression. In both tasks, target variables often exhibit statistical dependencies and exploiting them in order to improve predictive accuracy is a core challenge. A family of multilabel classification methods address this challenge by building a separate model for each target on an expanded input space where other targets are treated as additional input variables. Despite the success of these methods in the multilabel classification domain, their applicability and effectiveness in multitarget regression has not been studied until now. In this paper, we introduce two new methods for multitarget regression, called stacked singletarget and ensemble of regressor chains, by adapting two popular multilabel classification methods of this family. Furthermore, we highlight an inherent problem of these methods—a discrepancy of the values of the additional input variables between training and prediction—and develop extensions that use outofsample estimates of the target variables during training in order to tackle this problem. The results of an extensive experimental evaluation carried out on a large and diverse collection of datasets show that, when the discrepancy is appropriately mitigated, the proposed methods attain consistent improvements over the independent regressions baseline. Moreover, two versions of Ensemble of Regression Chains perform significantly better than four stateoftheart methods including regularizationbased multitask learning methods and a multiobjective random forest approach.
Keywords
Multitarget regression Multilabel classification Stacking Chaining1 Introduction
Multitarget regression (MTR), also known as multivariate or multioutput regression, refers to the task of predicting multiple continuous variables using a common set of input variables. Such problems arise in various fields including ecological modeling (Kocev et al. 2009; Dzeroski et al. 2000) (e.g. predicting the abundance of plant species using water quality measurements), economics (Ghosn and Bengio 1996) (e.g. predicting stock prices from econometric variables) and energy (e.g. predicting energy production in solar/wind farms using historical measurements and weather forecast information). Given the importance and diversity of its applications, it is not surprising that research on this topic has started as early as 40 years ago in Statistics (Izenman 1975).
Recently, a closely related task called multilabel classification (MLC) (Tsoumakas et al. 2010; Zhang and Zhou 2014) has received increased attention by Machine Learning researchers. Similarly to MTR, MLC deals with the prediction of multiple variables using a common set of input variables. However, prediction targets in MLC are binary. In fact, the two tasks can be thought of as instances of the more general learning task of multitarget prediction where targets can be continuous, binary, ordinal, categorical or even of mixed type. The baseline approach of learning a separate model for each target applies to both MTR and MLC. Moreover, they share the same core challenge of exploiting dependencies between targets (in addition to dependencies between targets and inputs) in order to improve prediction accuracy, as acknowledged by researchers working in both tasks (e.g. Izenman 2008; Dembczynski et al. 2012). Despite their commonalities, MTR and MLC have typically been treated in isolation and only few works (Blockeel et al. 1998; Weston et al. 2002; Teh et al. 2005; Balasubramanian and Lebanon 2012) have given a general formulation of their key ideas, recognizing the dual applicability of their approaches.
Motivated by the tight connection between the two tasks, this paper looks at a family of MLC methods that, despite being almost directly applicable to MTR problems, have not been applied so far in this domain. In particular, we consider methods that decompose the MLC task into a series of binary classification tasks, one for each label. This category, includes the typical oneversusall or Binary Relevance approach that assumes label independence but also approaches that model label dependencies by building models that treat other labels as additional input variables (metainputs). In this work we adapt two popular methods of this kind (Godbole and Sarawagi 2004; Read et al. 2011) for MTR, contributing two new MTR methods: Stacked singletarget (SST) and Ensemble of Regressor Chains (ERC). Both methods have been very successful in the MLC domain and provided inspiration for many subsequent works (Cheng and Hüllermeier 2009; Dembczynski et al. 2010; Kumar et al. 2012; Read et al. 2014).
Although the adaptation is trivial (as it basically consists of employing a regression instead of a binary classification algorithm to solve each singletarget prediction task), it widens the applicability of existing approaches and increases our understanding of challenges shared by both learning tasks, such as the modeling of target dependencies. This kind of abstraction of key ideas from solutions tailored to related problems can sometimes offer additional advantages, such as improving the modularity and conceptual simplicity of learning techniques and avoiding reinvention of the same solutions.^{1}
In addition to evaluating the direct adaptations of the corresponding MLC methods in the MTR domain, we also take a careful look at the treatment of targets as additional input variables and spot a shortcoming that was overlooked in the original MLC formulations of both methods. Specifically, we notice that in both methods the values of the metainputs are generated differently between training and prediction, causing a discrepancy that is shown to drastically downgrade their performance. To tackle this problem, we develop extended versions of the two methods that manage to decrease the discrepancy by using outofsample estimates of the targets during training. These estimates are obtained via an internal crossvalidation methodology.
The performance of the proposed methods is comprehensively analyzed based on a large experimental study that includes 18 diverse realworld datasets, 14 of which are firstly used in this paper and are made publicly available for future benchmarks. The experimental results reveal that, affected by the discrepancy problem, the direct adaptations of the corresponding MLC methods fail to obtain better accuracy than the baseline approach that performs independent regressions. On the other hand, the extended versions obtain consistent improvements against the baseline, confirming the effectiveness of the proposed solution. Furthermore, extended versions of ERC obtain significantly better accuracy than stateoftheart methods, including a method based on ensembles of multiobjective decision trees (Kocev et al. 2007) and a recent regularizationbased multitask learning method (Jalali et al. 2010, 2013). Moreover, it is shown that, compared to the rest of the methods, the extended versions of ERC are associated with the smallest risk of decreasing the accuracy of the baseline, an appealing property.
The rest of the paper is organized as follows: Sect. 2 presents the SST and ERC methods and describes the discrepancy problem and the proposed solution. Section 3 discusses related work from the MTR field, including wellknown statistical procedures and multitask learning methods, and points out differences with previous work on the discrepancy problem. The details of the experimental setup (method configuration, evaluation methodology, datasets) are given in Sects. 4 and 5 presents and discusses the experimental results. Finally, Sect. 6 offers our conclusion and outlines future work directions.
2 Methods
We first formally describe the MTR task and provide the notation that will be used subsequently for the description of the methods. Let \(\mathbf {X}\) and \(\mathbf {Y}\) be two random vectors where \(\mathbf {X}\) consists of d input variables \(X_1,\ldots ,X_d\) and \(\mathbf {Y}\) consists of m target variables \(Y_1,\ldots ,Y_m\). We assume that samples of the form \((\mathbf {x,y})\) are generated i.i.d. by some source according to a joint probability distribution \(\mathbf {P}(\mathbf {X,Y})\) on \(\mathscr {X} \times \mathscr {Y}\) where \(\mathscr {X}=R^d\) ^{2} and \(\mathscr {Y}=R^m\) are the domains of \(\mathbf {X}\) and \(\mathbf {Y}\) and are often referred to as the input and the output space. In a sample \((\mathbf {x,y})\), \(\mathbf {x}=[x_1,\ldots ,x_d]\) is the input vector and \(\mathbf {y}=[y_1,\ldots ,y_m]\) is the output vector which are realizations of \(\mathbf {X}\) and \(\mathbf {Y}\) respectively. Given a set \(D=\{(\mathbf {x}^1,\mathbf {y}^1),\ldots ,(\mathbf {x}^n,\mathbf {y}^n)\}\) of n training examples, the goal in MTR is to learn a model \(\mathbf {h}:\mathscr {X}\rightarrow \mathscr {Y}\) that given an input vector \(\mathbf {x}\), is able to predict an output vector \(\hat{\mathbf {y}} = \mathbf {h(x)}\) that best approximates the true output vector \(\mathbf {y}\).
In the baseline SingleTarget (ST) method, a multitarget model \(\mathbf {h}\) is comprised of m singletarget models \(h_j:\mathscr {X} \rightarrow R\) where each model \(h_j\) is trained on a transformed training set \(D_j = \{(\mathbf {x}^{1},y_{j}^{1}),\ldots ,(\mathbf {x}^{n},y_{j}^{n})\}\) to predict the value of a single target variable \(Y_j\). This way, target variables are modeled independently and no attempt is made to exploit potential dependencies between them. Despite the simplicity of the ST approach, several empirical studies (e.g. Luaces et al. 2012) have shown that Binary Relevance, its MLC counterpart, often obtains comparable performance with more sophisticated MLC methods that model label dependencies, especially in cases where the underlying singletarget prediction model is well fitted to the data (Dembczynski et al. 2012; Read and Hollmén 2014, 2015). A theoretical explanation of these results was offered by Dembczynski et al. (2012) who showed that modeling the marginal conditional distributions \(P(Y_i\mathbf {x})\) of the labels (as done by Binary Relevance) can be sufficient for getting good results in multilabel losses whose risk minimizers can be expressed in terms of marginal distributions (e.g. Hamming loss).
2.1 Stacked singletarget
Stacked singletarget (SST) is inspired from the Stacked Binary Relevance method (Godbole and Sarawagi 2004) where the idea of stacked generalization (Wolpert 1992) was applied in a MLC context. The training of SST consists of two stages. In the first stage, m independent singletarget models \(h_j:\mathscr {X} \rightarrow R\) are learned as in ST. However, instead of directly using these models for prediction, SST involves an additional training stage where a second set of m meta models \(h'_j: \mathscr {X} \times R^{m} \rightarrow R\) are learned, one for each target \(Y_j\). Each meta model \(h'_j\) is learned on a transformed training set \(D'_j=\{(\mathbf {x}'^{1},y^1_j),\dots ,(\mathbf {x}'^{n},y^n_j)\}\), where the original input vectors of the training examples (\(\mathbf {x}^{i}\)) have been augmented by estimates of the values of their target variables (\(\hat{y}^i_1,\ldots ,\hat{y}^i_m\)) to form expanded input vectors \(\mathbf {x}'^{i}=[\mathbf {x}^{i},\hat{y}^i_1,\ldots ,\hat{y}^i_m]\). These estimates are obtained by applying the first stage models to the examples of the training set.
2.2 Ensemble of regressor chains
Regressor Chains (RC) is derived from Classifier Chains (Read et al. 2011), a recently proposed MLC method based on the idea of chaining binary models. The training of RC consists of selecting a random chain (permutation) of the set of target variables and then building a separate regression model for each target. Assuming that the chain \(C = \{Y_1 , Y_2 ,\ldots , Y_m\}\) (C represents an ordered set) is selected, the first model concerns the prediction of \(Y_1\), has the form \(h_1: \mathscr {X} \rightarrow R\) and is the same as the model built by the ST method for this target. The difference in RC is that subsequent models \(h_{j}, j>1\) are learned on transformed training sets \(D'_j = \{(\mathbf {x}'^{1}_j,y^{1}_{j}),\ldots ,(\mathbf {x}'^{n}_j,y^{n}_{j})\}\), where the original input vectors of the training examples have been augmented by the actual values of all previous targets of the chain to form expanded input vectors \(\mathbf {x}'^{i}_j = [x^i_1,\ldots ,x^i_d,y^{i}_{1},\ldots ,y^{i}_{j1}]\). Thus, the models built for targets \(Y_{j}\) have the form \(h_j: \mathscr {X} \times R^{j1} \rightarrow R\).
One notable property of RC is that it is sensitive in the selected chain ordering. To alleviate this issue, Read et al. (2011) proposed an ensemble scheme called Ensemble of Classifier Chains where a set of k Classifier Chains models with different random chains are built on bootstrap samples of the training set and the final predictions come from majority voting. This scheme has been shown to consistently improve the accuracy of a single Classifier Chain in the classification domain. We apply the same idea on RC and compute the final predictions by taking the mean of the k estimates for each target. The resulting method is called Ensemble of Regressor Chains (ERC).
2.3 Theoretical insights into stacking and chaining

unconditional, where \(P(\mathbf {Y}) \ne \prod _{i=1}^{m} P(Y_i)\); and

conditional, where \(P(\mathbf {Y}\mathbf {x}) \ne \prod _{i=1}^{m} P(Y_i\mathbf {x})\),
Another interesting interpretation is offered by Read and Hollmén (2015) who show that Binary Relevance can (under certain conditions) achieve optimal performance in any dataset, and that improvements over the independent approach are often the result of using an inadequate base learner. Under this view, stacking and chaining can be considered as ‘deep’ independent learners who owe their improved performance over Binary Relevance (when the same base learner is used) to the use of labels as nodes in the inner layers of a deep neural network. These nodes represent readily available^{4} (in the training phase), highlevel transformations of the original inputs. This interpretation of stacking and chaining applies directly to the MTR versions of these methods that we present here.
From a biasvariance perspective, we observe that by introducing additional features to singletarget models, SST and ERC have the effect of decreasing their bias at the expense of an increased variance. This suggests that whenever the increase in variance is outweighed by the decrease in bias, one should expect gains in generalization performance over ST. This also hints that both methods will probably benefit from being combined with a base regressor that includes a variance reduction mechanism like bagged (Breiman 1996) regression trees.^{5} As shown in Munson and Caruana (2009), bagged trees not only ignore irrelevant features but can also exploit features that contain useful but noisy information. Both properties are very important in the context of SST and ERC because some of the extra features that they introduce might be irrelevant (e.g. whenever two target variables are statistically independent) and/or noisy (as discussed in the following subsection).
2.4 Generation of metainputs
Both SST and ERC are based on the same core idea of treating other prediction targets as additional input variables. These metainputs differ from ordinary inputs in the sense that while their actual values are available at training time, they are missing during prediction. Thus, during prediction both methods have to rely on estimates of these values which come either from ST (in the case of SST) or from RC (in the case of ERC) models built on the training set. An important question that is answered differently by each method is the following: What type of values should be used at training time for the metainputs? SST uses estimates of the variables obtained by applying the first stage models on the training examples, while ERC uses their actual values. We observe that in both cases a core assumption of supervised learning is violated: that the training and testing data should be identically and independently distributed. In the SST case, the insample estimates that are used to form the training examples of the second stage models will typically be more accurate than the outofsample estimates used at prediction time. The situation is even more problematic in the case of ERC since the actual target values are used during training. In both cases, some of the input variables that are used by the underlying regression algorithm during model induction, become noisy (or noisier in the case of SST) at prediction time and, as a result, the induced model might wrongly estimate (overestimate) their usefulness.
To mitigate this problem, we propose the use of outofsample estimates of the targets during training in order to increase the compatibility between the training values of the target variables and the values used during prediction. One way to obtain such estimates is to use a subset of the training set for building the first stage ST models (in the case of SST) or the RC models (in the case of ERC) and apply them to the heldout part. However, this approach would lead to reduced second stage training sets for SST as only the examples of the heldout set would be available for training the second stage models. The same holds for ERC where the chained RC models would be trained on training sets of decreasing size. The solution that we propose to this problem is the use of an internal ffold crossvalidation approach that allows obtaining outofsample estimates of the target variables for all the training examples. Compared to the actual target values or the insample estimates of the targets, the crossvalidation estimates are expected to better resemble the values that are used during prediction. As a result, we expect that the contribution of the metainputs to the prediction of each target will be better estimated by the underlying regression algorithm.
2.5 Discussion
Besides the type of values that each method uses for the metainputs at training time, SST and ERC have additional conceptual differences. A notable one is that the model built for each target \(Y_j\) by SST, uses all other targets as inputs while in RC each model involves only targets that precede \(Y_j\) in a random chain. As a result, the model built for \(Y_j\) by RC, cannot benefit from statistical relationships with targets that appear later than \(Y_j\) in the chain. This potential disadvantage of RC is partially overcome by ERC since each target is included in multiple random chains and, therefore, the probability that other targets will precede it is increased. At a first glance, SST seems to represent a more straightforward way of including all the available information about other targets. However, we should take into account that, since both methods rely on estimates of the metainputs at prediction time (as discussed in previous subsection), the more the metainputs that are included in the input space, the higher the amount of error accumulation that is risked at prediction time. From this perspective, ERC seems to adopt a more cautious approach than SST. On the other hand, the estimates of the metainputs that are used by the second stage models in SST come from independent models, while the estimates of the metainputs used by each model in RC (and ERC) come from models that include information about other targets and thus involve a higher risk of becoming noisy. Overall, there seems to be a tradeoff between using the additional information available in the targets and the noise that this information comes with. Which of the two methods (and which variant) achieves a better balance in this tradeoff is revealed by the experimental analysis in Sect. 5.
2.6 Complexity analysis
In this section we discuss the time complexity of all variants of the proposed methods at training and test time, given a singletarget regression algorithm with training complexity \(O(g_{tr}(n,d))\) and test complexity \(O(g_{te}(n,d))\) for a dataset with n examples and d input variables. The training and test complexities of the ST method are \(O(m {\cdot } g_{tr}(n,d))\) and \(O(m {\cdot } g_{te}(n,d))\) respectively, as it involves training and querying m independent singletarget models.
With respect to SST, the method builds \(2 {\cdot } m\) models at training time, all of which are queried at prediction time. In all variants of the method, half of the models are built on the original input space and half of the models are built on an input space augmented by m metainputs. Thus, in the case of SST\(_{true}\), where the metainputs are readily available, the training and test complexities are \(O(m {\cdot } (g_{tr}(n,d) {+} g_{tr}(n,d{+}m)))\) and \(O(m {\cdot } (g_{te}(n,d) {+} g_{te}(n,d{+}m)))\) respectively. Given that in most cases (see Table 3) the number of targets is much smaller than the number of inputs, i.e. \(m \ll d\), the effective training and test complexities of SST\(_{true}\) become \(O(m {\cdot } g_{tr}(n,d))\) and \(O(m {\cdot } g_{te}(n,d))\) respectively, thus same with ST’s complexities. SST\(_{train}\) and SST\(_{cv}\) have the same test complexity with SST\(_{true}\) but a larger training complexity because of the process of generating estimates for the metainputs. In the SST\(_{train}\) case, the training complexity is \(O(m {\cdot } g_{tr}(n,d) {+} m {\cdot } g_{te}(n,d))\) because the m firststage models are applied to obtain estimates for all the training examples. For most regression algorithms (e.g. regression trees), the computational cost of making predictions for n instances is much smaller than the cost of training on n examples. For instance, the training complexity of a typical binary regression tree learner is \(O(n {\cdot } d^2)\) (Su and Zhang 2006) while the test complexity is \(O(n {\cdot } \log _2 d)\). Thus, practically, the training complexity of SST\(_{train}\) is similar to that of SST\(_{true}\). When it comes to SST\(_{cv}\), in addition to the m firststage models, f additional models are built on \(\frac{f1}{f} {\cdot } n\) examples each. Therefore, the training complexity of SST\(_{cv}\) is \(O(m {\cdot } g_{tr}(n,d) + m {\cdot } f {\cdot } g_{tr}(\frac{f1}{f} {\cdot } n,d) + m {\cdot } g_{te}(n,d)) \approx O(f {\cdot } m {\cdot } g_{tr}(n,d) {+} m {\cdot } g_{te}(n,d))\). Given
that \(g_{te}(n,d)\) \(\ll \) \(g_{tr}(n,d)\), we conclude that the training complexity of SST\(_{cv}\) is roughly f times ST’s training complexity. Also, note that SST\(_{train}\) and SST\(_{cv}\) can be parellelized stagewise both at training and at prediction time, i.e. all singletarget models within the same level can be trained and queried independently, while SST\(_{true}\) is fully parallelizable at training time (all singletarget models can be trained independently) and stagewise parallelizable at test time.
Training and test complexities of the proposed methods with single and multicore implementations
Method  Training complexity  Test complexity  

Singlecore  Multicore  Singlecore  Multicore  
SST  true  \(O(m {\cdot } g_{tr}(n,d))\)  \(O(g_{tr}(n,d))\)  \(O(m {\cdot } g_{te}(n,d))\)  \(O(g_{te}(n,d))\) 
train  \(O(m {\cdot } g_{tr}(n,d))\)  \(O(g_{tr}(n,d))\)  \(O(m {\cdot } g_{te}(n,d))\)  \(O(g_{te}(n,d))\)  
cv  \(O(f {\cdot } m {\cdot } g_{tr}(n,d))\)  \(O(g_{tr}(n,d))\)  \(O(m {\cdot } g_{te}(n,d))\)  \(O(g_{te}(n,d))\)  
ERC  true  \(O(k {\cdot } m {\cdot } g_{tr}(n,d))\)  \(O(g_{tr}(n,d))\)  \(O(k {\cdot } m {\cdot } g_{te}(n,d))\)  \(O(m {\cdot } g_{te}(n,d))\) 
train  \(O(k {\cdot } m {\cdot } g_{tr}(n,d))\)  \(O( m {\cdot } g_{tr}(n,d))\)  \(O(k {\cdot } m {\cdot } g_{te}(n,d))\)  \(O(m {\cdot } g_{te}(n,d))\)  
cv  \(O(k {\cdot } f {\cdot } m {\cdot } g_{tr}(n,d))\)  \(O( m {\cdot } g_{tr}(n,d))\)  \(O(k {\cdot } m {\cdot } g_{te}(n,d))\)  \(O(m {\cdot } g_{te}(n,d))\) 
Table 1 summarizes the training and test complexities of each method assuming a singlecore implementation as well as the minimum possible complexity when a multicore implementation is used. Note that, as shown in the table, SST\(_{cv}\) and ERC\(_{cv}\) have the same multicore complexity with SST\(_{train}\) and ERC\(_{train}\) respectively because their internal crossvalidation procedure can also be parallelized.
3 Related work
3.1 Multitarget regression
MTR was first studied in Statistics under the term multivariate regression with Reduced Rank Regression (RRR) (Izenman 1975), FICYREG (Merwe and Zidek 1980) and twoblock PLS (Wold 1985) (the multiple response version of PLS) being three of the earliest methods. Among these methods, twoblock PLS has been used more widely, especially in Chemometrics. More recently, the Curds and Whey (C&W) method was proposed (Breiman and Friedman 1997) and was found to outperform RRR, FICYREG and twoblock PLS. As noted by Breiman and Friedman (1997), C&W, RRR and FICYREG can all be expressed using the same generic form \({\tilde{\mathbf{y}}= \mathbf{B} \hat{\mathbf{y}}}\), where \( {\hat{\mathbf{y}}}\) are estimates obtained by applying ordinary least squares regression on the target variables and \(\mathbf {B}\) is a matrix that modifies these estimates in order to obtain a more accurate prediction \({\tilde{\mathbf{y}}}\), under the assumption that the targets are correlated.
In all methods, \(\mathbf {B}\) can be expressed as \(\mathbf {B} = {\hat{\mathbf{T}}}^{\mathbf{1}} \mathbf {D} {\hat{\mathbf{T}}}\), where \({\hat{\mathbf{T}}}\) is the matrix of sample canonical coordinates and \(\mathbf {D}\) is a diagonal “shrinking” matrix that is obtained differently in each method. SST is highly similar to these methods but allows a more general formulation of the MTR problem. Firstly, SST does not impose any restriction to the family of models that generate the uncorrected (first stage) estimates in contrast to these approaches that use estimates obtained from least squares regression. Secondly, the correction of the estimates applied by SST comes from a learning procedure that jointly considers target and input variables rather than target variables alone.
As shown by Breiman and Friedman (1997), the above methods can be described by an alternative but equivalent scheme. According to this, \(\mathbf {y}\) is first transformed to the canonical coordinate system \(\mathbf {y}' = {\hat{\mathbf{T}}} \mathbf {y}\), then separate least squares regression is performed on each \(\mathbf {y}'\) to obtain \({\hat{\mathbf{y}}'}\), these estimates are scaled by \(\mathbf {D}\) to obtain \({\tilde{\mathbf{y}}'} = \mathbf {D} {\hat{\mathbf{y}}'}\) and finally transformed back to the original output space \({\tilde{\mathbf{y}}} = {\hat{\mathbf{T}}^{\mathbf{1}}} {\tilde{\mathbf{y}}'}\). As discussed by Dembczynski et al. (2012), from this perspective, these methods fall under a more general scheme where the output space is first transformed, singletarget regressors are then trained on the transformed output space and an inverse transformation is performed (possibly along with shrinkage/regularization) to obtain predictions for the original targets. Due to its generality, this scheme has been adopted by a number of recent methods in both MLC (Hsu 2009; Zhang and Schneider 2011, 2012; Tai and Lin 2012) and MTR (Balasubramanian and Lebanon 2012; Tsoumakas et al. 2014).
A large number of MTR methods are derived from the predictive clustering tree (PCT) framework (Blockeel et al. 1998). The main difference between the PCT algorithm and a standard decision tree is that the variance and the prototype functions are treated as parameters that can be instantiated to fit the given learning task. Such an instantiation for MTR tasks are the multiobjective decision trees (MODTs) where the variance function is computed as the sum of the variances of the targets, and the prototype function is the vector mean of the target vectors of the training examples falling in each leaf (Blockeel et al. 1998, 1999). Bagging and random forest ensembles of MODTs were developed by Kocev et al. (2007) and were found significantly more accurate than MODTs and equally good or better than ensembles of singleobjective decision trees for both regression and classification tasks. In particular, multiobjective random forests yielded better performance than multiobjective bagging.
Methods that deal with the prediction of multiple target variables can be found in the literature of the related learning task of multitask learning. According to Caruana (1997), multitask learning is a form of inductive transfer (Pratt 1992) where the aim is to improve generalization accuracy on a set of related tasks by using a shared representation that exploits commonalities between them. This definition implies that a multitask method should be able to deal with problems where different prediction tasks do not necessarily share the same set of training examples or descriptive features and, moreover, each task can have a different data type. Thus, multitask learning is actually a generalization of MTR.
Artificial neural networks (ANNs) are very well suited for multitask problems because they can be naturally extended to support multiple outputs and offer flexibility in defining how inputs are shared between tasks. Thus, it is not surprising that most of the earliest multitask methods were based on ANNs. Caruana (1994), for example, proposed a method where backpropagation is used to train single ANN with multiple outputs (connected to the same hidden layers), and showed that it has better generalization performance compared to multiple singletask ANNs. A different architecture was used by Baxter (1995) where only the first hidden layers are shared and subsequent layers are specific to each task. The question of how much sharing is better when multitask ANNs are applied for stock return prediction was explored by Ghosn and Bengio (1996) who concluded that a partial sharing of network parameters is preferable compared to full or no sharing. More recently, Collobert and Weston (2008) applied a deep multitask neural network architecture for natural language processing.
A large number of multitask learning methods stem from a regularization perspective.^{6} Regularizationbased multitask methods minimize a penalized empirical loss of the form \(\displaystyle \min _W \mathscr {L}(W)+\varOmega (W)\), where W is a parameter matrix that has to be estimated, \(\mathscr {L}(W)\) is an empirical loss calculated on the training data and \(\varOmega (W)\) is a regularization term that takes a different form in each method depending on the underlying task relatedness assumption. Most methods assume that all tasks are related to each other (Evgeniou and Pontil 2004; Ando and Zhang 2005; Argyriou et al. 2006, 2008; Chen et al. 2009, 2010a; Obozinski et al. 2010), while there are methods assuming that tasks are organized in structures such as clusters (Jacob et al. 2008; Zhou et al. 2011a), trees (Kim and Xing 2010) and graphs (Chen et al. 2010b). A wellstudied category of methods, which are particularly useful when dealing with highdimensional input spaces, assume that models for different tasks share a common lowrank subspace and impose a tracenorm constraint on the parameter matrix (Argyriou et al. 2006, 2008; Ji and Ye 2009). A similar category of methods constraint all models to share a common set of features (thus performing a joint feature selection), typically by applying \(L_1 / L_q\)norm (\(q>1\)) regularization (Obozinski et al. 2010). An approach that relaxes the above restrictive constraint allowing models to leverage different extents of feature sharing is proposed in Jalali et al. (2010, 2013).
Finally, we would like to mention that a number of MTR methods are based on the Gaussian Processes framework (e.g., Bonilla et al. 2007; Álvarez and Lawrence 2011). These methods capture correlations between tasks by appropriate choices of covariance functions. A nice review of such methods as well as their relations to regularizationbased multitask approaches can be found in Alvarez et al. (2011).
3.2 Discrepancy in metainputs
In the MLC domain, Senge et al. (2013a) studied how the discrepancy issue affects the performance of Classifier Chains and showed that longer chains (i.e. multilabel problems with more labels to be predicted) lead to a higher performance deterioration. In an extension of that work Senge et al. (2013b), a “rectified” version of Classifier Chains (called Nested Stacking) was presented that uses insample estimates of the label variables for training as in Stacked Binary Relevance. It was shown that this method performs better than the original Classifier Chains, especially when the label dependencies are strong. Following the opposite direction, Montañés et al. (2011) proposed AID, a method similar to Stacked Binary Relevance, and found that using the actual label values instead of (insample) estimates, leads to better results for most multilabel evaluation measures in both AID and Stacked Binary Relevance.
Our work is the first to study this issue in the MTR domain.^{7} The issue is studied jointly for SST and ERC, thus allowing general conclusions to be drawn for this type of methods. Furthermore, Montañés et al. (2011), Senge et al. (2013b) compared only the use of actual target values with the use of insample estimates while our comparison includes the use of outofsample estimates obtained by a crossvalidation procedure. Finally, Senge et al. (2013b) evaluate the use of estimates in Classifier Chains whereas we focus on the ensemble version of the corresponding MTR method (ERC) that is expected to offer more resilience to error propagation, as discussed in Sect. 2.5.
4 Experimental setup
This section describes our experimental setup. We first present the participating methods and their parameters and provide details about their implementation in order to facilitate reproducibility of the experiments. Next, we describe the evaluation measure and explain the process that was followed for the statistical comparison of the methods. Finally, we present the datasets that we used and their main statistics.
4.1 Methods, parameters and implementation
Methods used in experiments with abbreviations and citations
Abbr.  Method  Citation 

ST  Single target  
SST\(_{true}\)  Stacked ST, true values  This paper 
SST\(_{train}\)  Stacked ST, insample estimates  This paper 
SST\(_{cv}\)  Stacked ST, cv estimates  This paper 
ERC\(_{true}\)  Ensemble of Regressor Chains, true values  This paper 
ERC\(_{train}\)  Ensemble of Regressor Chains, insample estimates  This paper 
ERC\(_{cv}\)  Ensemble of Regressor Chains, cv estimates  This paper 
MORF  MultiObjective Random Forest  Kocev et al. (2007) 
TNR  Trace Norm Regularization multitask learning  Argyriou et al. (2008) 
Dirty  A Dirty model for multitask learning  
RLC  Random Linear target Combinations  Tsoumakas et al. (2014) 
The proposed methods as well as ST and RLC transform the mutlitarget regression task into a series of singletarget regression tasks which can be dealt with using any standard regression algorithm. For most of the experiments, we use bagged regression trees as the base regressor. This choice was motivated in Sect. 2.3 and is further discussed in Sect. 5.1 where we present results using a variety of wellknown linear and nonlinear regression algorithms. The ensemble size of all ERC variants is set to \(k=10\) RC models, each one trained using a different random chain. In datasets with less than 10 distinct chains, we create exactly as many RC models as the number of distinct chains. Furthermore, since the base regressor involves bootstrap sampling, we do not perform sampling in ERC, i.e. each RC model is trained using all training examples. In SST, we exclude the target being predicted by each second stage model from the input space of that model as we found that this choice improves slightly the performance of all variants of this method. \(f=10\) internal crossvalidation folds are used in both SST\(_{cv}\) and ERC\(_{cv}\).
Concerning the parameter settings of the competitive methods, in MORF we use an ensemble size of 100 trees and the values suggested by Kocev et al. (2007) for the rest of its parameters. In RLC, we generate \(r=100\) new target variables by combining \(k=2\) of the original target variables (after bringing them to the [0, 1] interval). As shown in Tsoumakas et al. (2014), these values lead to near optimal results. In TNR, we minimize the squared loss function using the accelerated gradient method for trace norm minimization (Ji and Ye 2009). The regularization parameter is tuned by selecting among the values \(\{10^r : r \in \{3,\ldots ,3\}\}\) with internal 5fold crossvalidation. Before applying TNR, we apply zscore normalization and add a bias column as suggested in Zhou et al. (2011b). Finally, Dirty is setup as suggested in Jalali et al. (2013): Input variables are scaled to the \([1,1]\) range by dividing them with their maximum values. The regularization parameters \(\lambda _b\) and \(\lambda _s\) are tuned via internal 5fold crossvalidation (as in TNR). As suggested in Jalali et al. (2013), we set \(\lambda _b = c \sqrt{\frac{m \log d}{n}}\), where \(c \in \{10^r : r \in \{2,\ldots ,2\}\} \) is a constant. Each distinct value of \(\lambda _b\) is paired with five values of \(\lambda _s = \frac{\lambda _b}{1+ \frac{m1}{4} i}, i \in \{0,1,2,3,4\}\), thus respecting the \(\frac{\lambda _s}{\lambda _b} \in [\frac{1}{m},1]\) relationship dictated by the optimality conditions. In total, 25 different combinations of \(\lambda _b\) and \(\lambda _s\) are evaluated.
All the proposed methods and the evaluation framework were implemented in Java and integrated in Mulan^{8} (Tsoumakas et al. 2011) by expanding its functionality to multitarget regression. The implementation of all singletarget regression algorithms that were used to instantiate problem transformation methods are taken from Weka.^{9} With respect to the competing methods, RLC was already integrated in Mulan while for the purposes of this study we also integrated MORF (via a wrapper of the implementation offered in CLUS^{10}) as well as TNR and Dirty (via wrappers of the implementations offered in MALSAR (Zhou et al. 2011b)). Thus, all methods were evaluated under a common framework. In support of open science, we created a github project^{11} that contains all our implementations, including code that facilitates easy replication of our experimental results.
4.2 Evaluation
To test the statistical significance of the observed differences between the methods, we follow the methodology suggested by Demsar (2006). To compare multiple methods on multiple datasets we use the Friedman test, the nonparametric alternative of the repeatedmeasures ANOVA. The Friedman test operates on the average ranks of the methods and checks the validity of the hypothesis (nullhypothesis) that all methods are equivalent. Here, we use an improved (less conservative) version of the test that uses the \(F_f\) instead of the \(\chi ^2_F\) statistic (Iman and Davenport 1980). When the nullhypothesis of the Friedman test is rejected (\(p < 0.01\)), we proceed with the Nemenyi posthoc test that compares all methods to each other in order to find which methods in particular differ from each other. Instead of reporting the outcomes of all pairwise comparisons, we employ the simple graphical presentation of the test’s results introduced by Demsar (2006), i.e. all methods being compared are placed in a horizontal axis according to their average ranks and groups of methods that are not significantly different (at a certain significance level) are connected (see Fig. 4 for an example). To generate such a diagram, a critical difference (CD) should be calculated that corresponds to the minimum difference in average ranks required for two methods to be considered significantly different. CD for a given number of methods and datasets, depends on the desired significance level. Due to the known conservancy of the Nemenyi test (Demsar 2006), we use a 0.05 significance level for computing the CD throughout the paper.
As the above methodology requires a single performance measurement for each method on each dataset, it is not directly applicable to multitarget evaluation where we have multiple performance measurements (one for each target) for each method on each dataset. One option is to take the average RRMSE (aRRMSE) across all target variables within a dataset as a single performance measurement. This choice, however, has the disadvantage that a very small or large error on a single target might dominate the average, thus obscuring performance differences on the target level. Another option is to treat the RRMSE of each method on each target as a different performance measurement. In this case, Friedman test’s assumption of independence between performance measurements might be violated. In the absence of a better solution, we perform a twodimensional analysis (as done e.g. by Aho et al. 2012) where statistical tests are conducted using both aRRMSE (per dataset analysis) but also considering RRMSE per target as an independent performance measurement (per target analysis).
4.3 Datasets
Name, source, number of examples, number of input variables (d) and number of target variables (m) of the datasets used in the evaluation
Dataset  Source  Examples  d  m 

edm  Karalic and Bratko (1997)  154  16  2 
sf1  Lichman (2013)  323  10  3 
sf2  Lichman (2013)  1066  10  3 
jura  Goovaerts (1997)  359  15  3 
wq  Dzeroski et al. (2000)  1060  16  14 
enb  Tsanas and Xifara (2012)  768  8  2 
slump  Yeh (2007)  103  7  3 
andro  Hatzikos et al. (2008)  49  30  6 
osales  Kaggle (2012)  639  413  12 
scpf  Kaggle (2013)  1137  23  3 
atp1d  This paper  337  411  6 
atp7d  This paper  296  411  6 
oes97  This paper  334  263  16 
oes10  This paper  403  298  16 
rf1  This paper  9125  64  8 
rf2  This paper  9125  576  8 
scm1d  This paper  9803  280  16 
scm20d  This paper  8966  61  16 
Table 3 reports the name (1st column), source (2nd column), number of examples (3rd column), number of input variables (4th column) and number of target variables (5th column) of each dataset. Detailed descriptions of all datasets are provided in “Appendix”.
5 Experimental analysis
In this section we present an extensive experimental analysis of the performance of the proposed methods. Sect. 5.1 is devoted to an exploration of the performance of ST using various wellknown regression algorithms. The purpose of this investigation is to help us select an algorithm that works well on the studied datasets and use it as base regressor in all problem transformation methods (ST, SST, ERC and RLC) in subsequent experiments. At the same time, a challenging baseline performance level will be set for all multitarget methods. In Sect. 5.2 we evaluate SST\(_{train}\) and ERC\(_{true}\), the direct adaptations of the corresponding MLC methods, in order to see whether these variants obtain a competitive performance compared to ST and stateoftheart multitarget methods. Next, in Sect. 5.3 all three metainput generation variants (true, train, cv) of SST and ERC are evaluated and compared to ST, shedding light into the impact of the discrepancy problem on each method. After the best performing variants of each method have been identified, Sect. 5.4 compares them with the stateoftheart. The running times of all methods are reported and compared in Sect. 5.5, and finally, this section ends with a discussion of the main outcomes of the experimental results (Sect. 5.6).
5.1 Base regressor exploration
In this subsection we explore the performance of ST on the studied domains using a variety of regression algorithms. The goal of this exploration is to help us identify a regression algorithm that performs well across many domains, thus setting a challenging baseline performance level for the multitarget methods that we study next. The algorithm that will emerge as the best performer will be used to instantiate all problem transformation methods (ST, SST, ERC and RLC) in the rest of the experiments, facilitating a fair comparison between these methods.
We selected five wellknown linear and nonlinear regression algorithms to couple ST with, in particular we use: ridge regression (Hoerl and Kennard 1970) (ridge), regression tree Breiman et al. (1984) (tree), L2regularized support vector regression regression (Drucker et al. 1996) (svr), bagged (Breiman 1996) regression trees (bag) and stochastic gradient boosting (Friedman 2002) (sgb). In ridge and svr, the regularization parameter was tuned (separately for each target) by applying internal 5fold crossvalidation and choosing the value that leads to the lowest root mean squared error among \(\{10^r : r \in \{4,\ldots ,2\}\}\). In bag we combine the predictions of 100 trees while in sgb we boost trees with four terminal nodes using a small shrinkage rate (0.1) and a large number of iterations (100), as suggested by Friedman et al. (2001).
5.2 Evaluation of direct adaptations
In this subsection we focus on SST\(_{train}\) and ERC\(_{true}\), the versions of SST and ERC that use the same type of values for the metainputs as their MLC counterparts, and compare their performance to that of ST, MORF, RLC, TNR and Dirty to see where these methods stand with respect to the stateoftheart.
Interestingly, however, we see that according to both the per dataset and the per target analysis, SST\(_{train}\) and ERC\(_{true}\) are not significantly better than ST. This is an indication that the use of targets as metainput as implemented by these variants of SST and ERC does not bring significant improvements. Actually, as can be seen from the detailed results, both SST\(_{train}\) and ERC\(_{true}\) perform worse than ST in several cases. This issue is studied in more detail in the following subsection.
Perhaps even more interestingly, none of the stateoftheart multitarget methods participating in this comparison manages to significantly improve the performance of ST. In fact, ST is ranked second after SST\(_{train}\) in the per dataset analysis and third after SST\(_{train}\) and RLC in the per target analysis, and is found significantly better than TNR and Dirty in both types of analyses. This exceptionally good performance of ST might seem a bit surprising given the results of previous studies (e.g. Kocev et al. 2007; Tsoumakas et al. 2014) but is in accordance with empirical and theoretical results for Binary Relevance (as discussed in Sect. 2) and is attributed to the use of a very strong base regressor.
5.3 Evaluation of metainput generation variants
Figures 6 and 7 show the average ranks and the results of the Friedman and Nemenyi tests for the three variants of SST and ERC, respectively, according to the per dataset (left) and the per target (right) analysis. First, we see that in both SST and ERC and in both types of analyses, the variants that use the actual values of the targets (true) obtain the worst average ranks and are found significantly worse than both variants that use estimates (train and cv). Since the variants of each method differ only with respect to the type of values that they use for the metainputs, it is clear that the discrepancy problem has a significant impact on the performance of both SST and ERC and that the use of estimates can ameliorate this problem.
5.3.1 Cautiousness analysis
So far, our analysis has focused on the average performance of the proposed methods (as quantified by their average ranks over datasets and targets) and we found that ERC\(_{train}\) and ERC\(_{cv}\) outperform the independent regressions baseline significantly. However, it is also important to see the consistency of these improvements across different datasets and targets. In particular, we would like to study the degree of cautiousness that each method exhibits, i.e. how frequently and to what extent are the predictions produced by each method less accurate than the predictions of ST.
We see that that in both the per dataset and the per target analysis, the true variants are the ones exhibiting the more dispersed distributions with several cases of significant degradation of ST’s performance. The train and cv variants are clearly more cautious with much fewer cases of degradation and even fewer cases of significant degradation. Looking at the distributions of \(R_t\), we could say that the cv variants appear a bit more cautious than the train variants especially in the case of SST. We also see that the ERC variants are always more cautious than the corresponding SST variants. Clearly, ERC\(_{train}\) and ERC\(_{cv}\) are the two most cautious methods since they obtain very similar or better performance than ST on all datasets and on about 75 % of the targets. Even on targets where the two methods obtain a lower performance than ST, the reduction is less than about 5 %. This characteristic along with the fact that they obtain the largest average improvements over ST, make ERC\(_{train}\) and ERC\(_{cv}\) highly appealing.
5.4 Comparison with the stateoftheart
5.5 Running times
In this subsection we compare the running times of the studied methods. Experiments were run on a 64bit CentOS Linux machine with 80 Intel Xeon E74860 processors running at 2.27 GHz and 1 TB of main memory. The detailed results per method and dataset are shown in Table 4. For ST, RLC, SST and ERC we report times with bag as base regressor. The number shown in parenthesis next to the name of each dataset corresponds to the maximum number of processor threads that were available during the experiment. ST, SST, ERC and RLC made use of multiple threads through Weka’s multithreaded implementation of Bagging. Thus, running times are directly comparable for these methods. Multithreading was also partly used in TNR for the computation of the gradients. dirty and MORF, on the other hand, always used a single processor thread.
Looking at the aggregated running times, we see that MORF is by far the most efficient method, followed by ST, SST\(_{true}\) and SST\(_{train}\) which have similar running times. On the other hand, dirty is the least efficient method, followed by ERC\(_{cv}\). The running times of the rest of the methods lie in between. With respect to the SST and ERC variants, we see that their running times agree with the complexity analysis of Sect. 2.6. The total running time of SST\(_{true}\) is roughly twice the total running time of ST and similar to the total running time of SST\(_{train}\). SST\(_{cv}\) is the least efficient among SST variants with a total running time that is about 5 times larger than that of SST\(_{true}\) and SST\(_{train}\). With respect to the ERC variants, we see that ERC\(_{true}\) and ERC\(_{train}\) have similar total running times (which are also roughly similar to the total running time of SST\(_{cv}\)) while ERC\(_{cv}\) is about 7.5 times slower.
Running times (in s) using bag as base regressor
Dataset  ST  SST\(_{true}\)  SST\(_{train}\)  SST\(_{cv}\)  ERC\(_{true}\)  ERC\(_{train}\)  ERC\(_{cv}\)  MORF  RLC  TNR  dirty 

edm (2)  4.3  3.3  3.2  14.5  2.8  2.8  13.7  6.3  78.3  16.6  281.4 
sf1 (3)  4.1  4.1  4.3  16.7  11.2  11.8  60.8  8.1  79.2  11.8  60.5 
sf2 (3)  7.5  11.5  11.4  47.6  33.2  36.3  262.0  15.4  238.7  17.3  56.9 
jura (3)  10.9  15.3  14.9  72.6  46.4  46.9  264.2  14.0  199.1  17.0  49.2 
wq (6)  43.3  104.0  141.7  540.0  561.3  607.0  4648.4  106.3  295.0  143.7  531.2 
enb (2)  10.4  15.4  15.9  76.0  16.7  16.4  74.9  15.7  313.9  35.8  223.0 
slump (3)  4.7  4.2  4.0  16.5  11.4  11.1  67.8  4.2  48.5  10.7  55.6 
andro (2)  8.2  10.6  9.0  47.9  49.5  47.2  360.3  3.1  72.5  445.3  1670.8 
onsales (8)  317.6  612.2  560.8  2803.0  2835.0  2801.6  21568.6  43.2  1628.0  2769.1  80616.2 
scpf (3)  22.9  36.2  36.6  196.5  106.2  106.4  651.2  13.4  491.3  84.5  215.4 
atp1d (4)  155.6  356.1  345.7  1614.7  1496.0  1555.6  12169.2  23.3  2408.0  2212.8  57276.5 
atp7d (4)  145.8  318.6  313.9  1517.2  1336.9  1385.0  11158.2  18.4  2179.2  1536.0  53974.0 
oes97 (6)  199.5  454.8  422.2  1998.4  1977.3  1965.5  16609.1  32.7  1057.1  4581.3  124231.5 
oes10 (6)  286.4  613.1  535.7  2725.9  2691.0  2575.1  21446.6  39.4  1182.6  6093.8  157399.6 
rf1 (8)  379.6  868.2  890.8  4180.7  3885.7  4018.8  30050.1  351.1  4184.7  2952.1  34927.2 
rf2 (10)  3505.5  6539.0  6065.6  31562.8  28347.1  28104.0  234088.5  325.2  35429.1  18130.9  197482.1 
scm1d (10)  1398.5  2499.5  2398.3  11111.9  10465.6  11253.2  113500.0  199.7  6018.6  5647.1  105215.8 
scm20d (8)  302.6  613.9  621.8  2615.8  2733.1  2646.9  18549.2  140.8  1627.4  1120.2  4779.6 
Total  6807.6  13079.8  12395.8  61158.7  56606.3  57191.6  485542.9  1360.2  57531.3  45826.0  819046.5 
5.6 Discussion
Several interesting conclusions can be drawn from our experimental results. The experiments of Sects. 5.2 and 5.3 showed that while the directly adapted versions of SST and ERC have comparable or better performance than stateoftheart methods, a careful handling of the discrepancy problem is crucial for obtaining consistent improvements over the independent regressions baseline and the stateoftheart. In particular, as the experiments of Sect. 5.3 revealed, the use of estimates for the metainputs during training should clearly be preferred over using the actual target values. With regard to using insample versus outofsample estimates, the results indicate that while outofsample estimates are preferable in SST, ERC performs almost equally well using either type of estimates for the metainputs. As discussed in Sect. 2.5, ERC’s models are built on input spaces which are expanded with fewer metainputs compared to SST’s models and, as a result, a smaller amount of error accumulation is risked at prediction time.
Another interesting conclusion is that when a strong base regressor is employed, the task of improving the performance of ST becomes very difficult. As a result, multitarget methods which are considered stateoftheart fail to improve ST’s performance and are even performing significantly worse. This was particularly the case for the two multitask methods, TNR and Dirty, which were consistently found to be the worst performers. One explanation for their bad performance is the fact that both methods are based on a linear formulation of the problem that, as revealed by the base regressor exploration experiments, is not the most suitable hypothesis representation for the studied datasets (ridge and svr performed worse than sgb and bag that are based on a nonlinear hypothesis representation). Moreover, multitask methods are expected to work better than singletask methods in cases where there is a lack of training data for some of the tasks (Alvarez et al. 2011). This is not the case for most of the datasets that we used in this study as well as many recent multitarget prediction problems. In fact, the two datasets where TNR and Dirty perform better than ST (sf1 and slump) are among those with the fewest training examples.
With respect to MORF, although it was found significantly more competitive than TNR and Dirty, it also performed worse than ST on average. Nevertheless, we should point out that MORF achieved the best accuracy on three datasets (edm, wq, andro) and is the most computationally efficient of the compared methods. Similarly to TNR and Dirty, MORF has the disadvantage of having a fixed hypothesis representation (trees), as opposed to the proposed methods that have the ability of adapting better to a specific domain by being instantiated with a more suitable base regressor. This advantage of the proposed methods is shared with RLC which, however, was not found as accurate.
Overall, our experimental results demonstrate that of the methods proposed in this paper, ERC\(_{train}\) and ERC\(_{cv}\) and, to a lesser extent, SST\(_{train}\) and SST\(_{cv}\) provide increased accuracy over doing a separate regression per target. In addition, ERC\(_{train}\) and ERC\(_{cv}\) are significantly more accurate than TNR, Dirty, MORF and RLC (in the per target analysis). If caution is a further concern, then again ERC\(_{train}\) and ERC\(_{cv}\) compare favorably to the rest of the methods. With respect to the true variants of SST and ERC, we should stress out that despite having a worse average performance, they are worthy of being considered by a practitioner as they obtain the highest performance in datasets (e.g., sf1 and scfp) where the discrepancy problem is not predominant.
6 Conclusion
Motivated by the similarity between the tasks of multilabel classification and multitarget regression, this paper introduced SST and ERC, two new multitarget regression techniques derived through a simple adaptation of two wellknown multilabel classification methods. Both methods are based on the idea of treating other prediction targets as additional input variables, and represent a conceptually simple way of exploiting target dependencies in order to improve prediction accuracy.
A comprehensive experimental analysis that includes a multitude of realworld datasets and four existing stateoftheart methods, reveals that, despite being competitive with the stateoftheart, the directly adapted versions of SST and ERC do not manage to obtain significant improvements or even degrade the performance of the independent regressions baseline. This degradation is attributed to an underestimation (in the original formulations of the methods) of the impact of the discrepancy of the values used for the additional input variables between training and prediction. Confirming our hypothesis, extended versions of the methods that attempt to mitigate the discrepancy using outofsample estimates of the targets during training, manage to obtain consistent and significant improvements over the baseline approach and are found significantly better than four stateoftheart methods. The fact that these impressive results were obtained by applying relatively simple adaptations of existing multilabel classification methods, highlights the importance of exploiting relationships between similar machine learning tasks.
Concluding, let us point to some directions for future work. Although a mitigation of the discrepancy problem leads to significant performance improvements, a different amount of mitigation is ideal for each target. As a result, the use of insample estimates (or even the actual target values) gives better results for some targets. Thus, a promising direction for future work would be a deeper theoretical analysis of the different variants and the identification of problem characteristics that favor the use of one variant over the other. Finally, we should point out that SST and ERC can be viewed as strategies for leveraging variables that are available in the training phase but not in the prediction phase. This type of scenario is very common, for instance, in time series prediction. We believe that adapting SST and ERC for this type of problems is another valuable opportunity for future work.
Footnotes
 1.
See NIPS’11 workshop on relations among machine learning problems at http://rml.anu.edu.au/.
 2.
\(\mathscr {X}=R^d\) is used only for the sake of brevity. The domain of the input variables can also be discrete.
 3.
Note, however, that this analysis concerns a version of stacking that does not include the original input variables in the input space of the second stage models.
 4.
This is in contrast with traditional deep learning where highlevel feature representations are typically learned from the data in an unsupervised way.
 5.
An explicit feature selection could alternatively be applied as a means of variance reduction.
 6.
A nice categorization of regularizationbased multitask methods can be found in Zhou et al. (2012).
 7.
Actually, an early version of this work SpyromitrosXioufis et al. (2012) is the first to consider the discrepancy problem in the context of input space expansion methods.
 8.
 9.
 10.
 11.
 12.
The reliability of the estimates obtained using \(k=2\) and \(k=5\) has been validated by checking the stability of the rankings of the methods when repeating the crossvalidation experiment with different random seeds.
 13.
 14.
The detailed results per dataset and target can be found in Appendix “Multitarget regression results”.
References
 Aho, T., Zenko, B., Dzeroski, S., & Elomaa, T. (2012). Multitarget regression with rule ensembles. Journal of Machine Learning Research, 13, 2367–2407.MathSciNetzbMATHGoogle Scholar
 Álvarez, M. A., & Lawrence, N. D. (2011). Computationally efficient convolved multiple output gaussian processes. Journal of Machine Learning Research, 12, 1459–1500.MathSciNetzbMATHGoogle Scholar
 Alvarez, M. A., Rosasco, L., & Lawrence, N. D. (2011). Kernels for vectorvalued functions: A review. arXiv preprint arXiv:1106.6251.
 Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6, 1817–1853.MathSciNetzbMATHGoogle Scholar
 Argyriou, A., Evgeniou, T., & Pontil, M. (2006). Multitask feature learning. In Advances in neural information processing systems 19, Proceedings of the twentieth annual conference on neural information processing systems, Vancouver, British Columbia, Canada, December 4–7, 2006, pp. 41–48.Google Scholar
 Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multitask feature learning. Machine Learning, 73(3), 243–272.CrossRefGoogle Scholar
 Balasubramanian, K., & Lebanon, G. (2012). The landmark selection method for multiple output prediction. In Proceedings of the 29th international conference on machine learning, ICML 2012, Edinburgh, Scotland, UK, June 26–July 1, 2012.Google Scholar
 Baxter, J. (1995). Learning internal representations. In Proceedings of the eigth annual conference on computational learning theory, COLT 1995, Santa Cruz, California, USA, July 5–8, 1995, pp. 311–320.Google Scholar
 Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Topdown induction of clustering trees. In Proceedings of the fifteenth international conference on machine learning (ICML 1998), Madison, Wisconsin, USA, July 24–27, 1998, pp. 55–63.Google Scholar
 Blockeel, H., Dzeroski, S., & Grbovic, J. (1999). Simultaneous prediction of mulriple chemical parameters of river water quality with TILDE. In Proceedings of third European conference in principles of data mining and knowledge discovery PKDD’99, Prague, Czech Republic, September 15–18, 1999, pp. 32–40.Google Scholar
 Bonilla, E. V., Chai, K. M. A., & Williams, C. K. I. (2007). Multitask gaussian process prediction. In Advances in neural information processing systems 20, proceedings of the twentyfirst annual conference on neural information processing systems, Vancouver, British Columbia, Canada, December 3–6, 2007, pp. 153–160.Google Scholar
 Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.MathSciNetzbMATHGoogle Scholar
 Breiman, L., & Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1), 3–54.MathSciNetCrossRefzbMATHGoogle Scholar
 Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth.zbMATHGoogle Scholar
 Caruana, R. (1994). Learning many related tasks at the same time with backpropagation. In Advances in neural information processing systems 7, [NIPS Conference, Denver, Colorado, USA, 1994], pp. 657–664.Google Scholar
 Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.MathSciNetCrossRefGoogle Scholar
 Chen, J., Tang, L., Liu, J., & Ye, J. (2009). A convex formulation for learning shared structures from multiple tasks. In Proceedings of the 26th annual international conference on machine learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009, pp. 137–144.Google Scholar
 Chen, J., Liu, J., & Ye, J. (2010a). Learning incoherent sparse and lowrank patterns from multiple tasks. In Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, July 25–28, 2010, pp. 1179–1188.Google Scholar
 Chen, X., Kim, S., Lin, Q., Carbonell, J. G., & Xing, E. P. (2010b). Graphstructured multitask regression and an efficient optimization method for general fused lasso. arXiv preprint arXiv:1005.3579.
 Cheng, W., & Hüllermeier, E. (2009). Combining instancebased learning and logistic regression for multilabel classification. Machine Learning, 76(2–3), 211–225.CrossRefGoogle Scholar
 Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the twentyfifth international conference on machine learning, (ICML 2008), Helsinki, Finland, June 5–9, 2008, pp. 160–167.Google Scholar
 Dembczynski, K., Cheng, W., & Hüllermeier, E. (2010). Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the 27th international conference on machine learning (ICML10), June 2124, 2010, Haifa, Israel, pp. 279–286.Google Scholar
 Dembczynski, K., Waegeman, W., Cheng, W., & Hüllermeier, E. (2012). On label dependence and loss minimization in multilabel classification. Machine Learning, 88(1–2), 5–45.MathSciNetCrossRefzbMATHGoogle Scholar
 Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.MathSciNetzbMATHGoogle Scholar
 Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. J., & Vapnik, V. (1996). Support vector regression machines. In Advances in neural information processing systems 9, NIPS, Denver, CO, USA, December 2–5, 1996, pp. 155–161.Google Scholar
 Dzeroski, S., Demsar, D., & Grbovic, J. (2000). Predicting chemical parameters of river water quality from bioindicator data. Applied Intelligence, 13(1), 7–17.CrossRefGoogle Scholar
 Evgeniou, T., & Pontil, M. (2004). Regularized multitask learning. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, Washington, USA, August 22–25, 2004, pp. 109–117.Google Scholar
 Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning. Berlin: Springer series in statistics Springer.zbMATHGoogle Scholar
 Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378.MathSciNetCrossRefzbMATHGoogle Scholar
 Ghosn, J., & Bengio, Y. (1996). Multitask learning for stock selection. In Advances in neural information processing systems 9, NIPS, Denver, CO, USA, December 2–5, 1996, pp. 946–952.Google Scholar
 Godbole, S., & Sarawagi, S. (2004). Discriminative methods for multilabeled classification. In Proceedings of 8th PacificAsia conference on advances in knowledge discovery and data mining, PAKDD 2004, Sydney, Australia, May 26–28, 2004, pp. 22–30.Google Scholar
 Goovaerts, P. (1997). Geostatistics for natural resources evaluation. Oxford: Oxford University Press.Google Scholar
 Groves, W., & Gini, M. L. (2011). Improving prediction in TAC SCM by integrating multivariate and temporal aspects via PLS regression. In Agentmediated electronic commerce. Designing trading strategies and mechanisms for electronic markets  AMEC 2011, Taipei, Taiwan, May 2, 2011, and TADA 2011, Barcelona, Spain, July 17, 2011, Revised Selected Papers, pp. 28–43.Google Scholar
 Groves, W., & Gini, M. L. (2015). On optimizing airline ticket purchase timing. ACM Transactions on Intelligent Systems and Technology (TIST), 7(1), 3.Google Scholar
 Hatzikos, E. V., Tsoumakas, G., Tzanis, G., Bassiliades, N., & Vlahavas, I. P. (2008). An empirical study on sea water quality prediction. KnowledgeBased Systems, 21(6), 471–478.CrossRefGoogle Scholar
 Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.MathSciNetCrossRefzbMATHGoogle Scholar
 Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multilabel prediction via compressed sensing. In Advances in neural information processing systems 22: 23rd annual conference on neural information processing systems 2009. Proceedings of a meeting held 7–10 December 2009, Vancouver, British Columbia, Canada, pp. 772–780.Google Scholar
 Iman, R. L., & Davenport, J. M. (1980). Approximations of the critical region of the friedman statistic. Communications in StatisticsTheory and Methods, 9(6), 571–595.CrossRefzbMATHGoogle Scholar
 Izenman, A. J. (1975). Reducedrank regression for the multivariate linear model. Journal of Multivariate Analysis, 5(2), 248–264.MathSciNetCrossRefzbMATHGoogle Scholar
 Izenman, A. J. (2008). Modern multivariate statistical techniques: Regression, classification, and manifold learning. New York: Springer.CrossRefzbMATHGoogle Scholar
 Jacob, L., Bach, F. R., & Vert, J. (2008). Clustered multitask learning: A convex formulation. In Advances in neural information processing systems 21. Proceedings of the twentysecond annual conference on neural information processing systems, Vancouver, British Columbia, Canada, December 8–11, 2008, pp 745–752.Google Scholar
 Jalali, A., Ravikumar, P. D., Sanghavi, S., & Ruan, C. (2010) A dirty model for multitask learning. In Advances in neural information processing systems 23: 24th annual conference on neural information processing systems 2010. Proceedings of a meeting held 6–9 December 2010, Vancouver, British Columbia, Canada, pp. 964–972.Google Scholar
 Jalali, A., Ravikumar, P. D., & Sanghavi, S. (2013). A dirty model for multiple sparse regression. IEEE Transactions on Information Theory, 59(12), 7947–7968.MathSciNetCrossRefGoogle Scholar
 Ji, S., & Ye, J. (2009). An accelerated gradient method for trace norm minimization. In Proceedings of the 26th annual international conference on machine learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009, pp. 457–464.Google Scholar
 Kaggle. (2012). Kaggle competition: Online product sales. https://www.kaggle.com/c/onlinesales
 Kaggle. (2013). Kaggle competition: See click predict fix. https://www.kaggle.com/c/seeclickpredictfix
 Karalic, A., & Bratko, I. (1997). First order regression. Machine Learning, 26(2–3), 147–176.CrossRefzbMATHGoogle Scholar
 Kim, S., & Xing, E.P. (2010). Treeguided group lasso for multitask regression with structured sparsity. In Proceedings of the 27th international conference on machine learning (ICML10), June 21–24, 2010, Haifa, Israel, pp. 543–550.Google Scholar
 Kocev, D., Vens, C., Struyf, J., & Dzeroski, S. (2007). Ensembles of multiobjective decision trees. In Proceedings of 18th European conference on machine learning: ECML 2007, Warsaw, Poland, September 17–21, 2007, pp. 624–631.Google Scholar
 Kocev, D., Džeroski, S., White, M. D., Newell, G. R., & Griffioen, P. (2009). Using singleand multitarget regression trees and ensembles to model a compound index of vegetation condition. Ecological Modelling, 220(8), 1159–1168.CrossRefGoogle Scholar
 Kumar, A., Vembu, S., Menon, A. K., & Elkan, C. (2012). Learning and inference in probabilistic classifier chains with beam search. In Proceedings of European conference on machine learning and knowledge discovery in databases, Part I, ECML PKDD 2012, Bristol, UK, September 24–28, 2012. pp. 665–680.Google Scholar
 Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml
 Luaces, O., Díez, J., Barranquero, J., del Coz, J. J., & Bahamonde, A. (2012). Binary relevance efficacy for multilabel classification. Progress in Artificial Intelligence, 1(4), 303–313.CrossRefGoogle Scholar
 Montañés, E., Quevedo, J. R., & del Coz, J. J. (2011). Aggregating independent and dependent models to learn multilabel classifiers. In Proceedings of European conference on machine learning and knowledge discovery in databases, Part II, ECML PKDD 2011, Athens, Greece, September 5–9, 2011, pp. 484–500.Google Scholar
 Munson, M. A., & Caruana, R. (2009). On feature selection, biasvariance, and bagging. In Proceedings of European conference on machine learning and knowledge discovery in databases, Part II, ECML PKDD 2009, Bled, Slovenia, September 7–11, 2009, pp. 144–159.Google Scholar
 Obozinski, G., Taskar, B., & Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2), 231–252.MathSciNetCrossRefGoogle Scholar
 Pardoe, D., & Stone, P. (2008). The 2007 TAC SCM prediction challenge. In AAAI 2008 workshop on trading agent design and analysis.Google Scholar
 Pratt, L. Y. (1992). Discriminabilitybased transfer between neural networks. In Advances in neural information processing systems 5, [NIPS Conference, Denver, Colorado, USA, November 30–December 3, 1992], pp. 204–211.Google Scholar
 Read, J., & Hollmén, J. (2014). A deep interpretation of classifier chains. In Proceedings of 13th international symposium on advances in intelligent data analysis XIII, IDA 2014, Leuven, Belgium, October 30–November 1, 2014, pp. 251–262.Google Scholar
 Read, J., & Hollmén, J. (2015). Multilabel classification using labels as hidden nodes. arXiv preprint arXiv:1503.09022.
 Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multilabel classification. Machine Learning, 85(3), 333–359.MathSciNetCrossRefGoogle Scholar
 Read, J., Martino, L., & Luengo, D. (2014). Efficient monte carlo methods for multidimensional learning with classifier chains. Pattern Recognition, 47(3), 1535–1546.CrossRefzbMATHGoogle Scholar
 Senge, R., del Coz, J. J., & Hüllermeier, E. (2013a) On the problem of error propagation in classifier chains for multilabel classification. In Proceedings of the 36th annual conference of the German classification society.Google Scholar
 Senge, R., del Coz, J. J., & Hüllermeier, E. (2013b) Rectifying classifier chains for multilabel classification. In LWA 2013. Lernen, Wissen & Adaptivität, workshop proceedings Bamberg, 7–9 October 2013, pp. 151–158.Google Scholar
 SpyromitrosXioufis, E., Tsoumakas, G., Groves, W., & Vlahavas, I. (2012). Multilabel classification methods for multitarget regression. ArXiv eprints arXiv:1211.6581v1.
 Su, J., & Zhang, H. (2006) A fast decision tree learning algorithm. In Proceedings, the twentyfirst national conference on artificial intelligence and the eighteenth innovative applications of artificial intelligence conference, July 16–20, 2006, Boston, Massachusetts, USA, pp 500–505Google Scholar
 Tai, F., & Lin, H. (2012). Multilabel classification with principal label space transformation. Neural Computation, 24(9), 2508–2542.MathSciNetCrossRefzbMATHGoogle Scholar
 Teh, Y. W., Seeger, M., & Jordan, M. I. (2005). Semiparametric latent factor models. In Proceedings of the tenth international workshop on artificial intelligence and statistics, AISTATS 2005, Bridgetown, Barbados, January 6–8, 2005.Google Scholar
 Tsanas, A., & Xifara, A. (2012). Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings, 49, 560–567.CrossRefGoogle Scholar
 Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multilabel data. In O. Maimon & L. Rokach (Eds.), Data mining and knowledge discovery handbook (2nd ed., pp. 667–685). Boston, MA: Springer.Google Scholar
 Tsoumakas, G., SpyromitrosXioufis, E., Vilcek, J., & Vlahavas, I. (2011). Mulan: A java library for multilabel learning. Journal of Machine Learning Research, 12, 2411–2414.MathSciNetzbMATHGoogle Scholar
 Tsoumakas, G., Xioufis, E.S., Vrekou, A., & Vlahavas, I. P. (2014). Multitarget regression via random linear target combinations. In Proceedings of European conference on machine learning and knowledge discovery in databases, Part III, ECML PKDD 2014, Nancy, France, September 15–19, 2014, pp. 225–240.Google Scholar
 Van Der Merwe, A., & Zidek, J. (1980). Multivariate regression analysis and canonical variates. Canadian Journal of Statistics, 8(1), 27–39.MathSciNetCrossRefzbMATHGoogle Scholar
 Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B., & Vapnik, V. (2002). Kernel dependency estimation. In Advances in neural information processing systems 15 [Neural Information Processing Systems, NIPS 2002, December 9–14, 2002, Vancouver, British Columbia, Canada], pp. 873–880.Google Scholar
 Wold, H. (1985). Partial least squares. In S. Kotz & N. L. Johnson (Eds.), Encyclopedia of statistical sciences (Vol. 6, pp. 581–591). New York: Wiley.Google Scholar
 Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259.MathSciNetCrossRefGoogle Scholar
 Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341–1390.CrossRefGoogle Scholar
 Wolpert, D. H. (2002). The supervised learning nofreelunch theorems. In R. Roy, M. Köppen, S. Ovaska, T. Furuhashi, & F. Hoffmann (Eds.), Soft computing and industry: Recent applications (pp. 25–42). London: Springer.Google Scholar
 Yeh, I. C. (2007). Modeling slump flow of concrete using secondorder regressions and artificial neural networks. Cement and Concrete Composites, 29(6), 474–480.CrossRefGoogle Scholar
 Zhang, M., & Zhou, Z. (2014). A review on multilabel learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819–1837.CrossRefGoogle Scholar
 Zhang, Y., & Schneider, J. G. (2011). Multilabel output codes using canonical correlation analysis. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, AISTATS 2011, Fort Lauderdale, USA, April 11–13, 2011, pp. 873–882.Google Scholar
 Zhang, Y., & Schneider, J. G. (2012). Maximum margin output coding. In Proceedings of the 29th international conference on machine learning, ICML 2012, Edinburgh, Scotland, UK, June 26–July 1, 2012.Google Scholar
 Zhou, J., Chen, J., & Ye, J. (2011a). Clustered multitask learning via alternating structure optimization. In Advances in neural information processing systems 24: 25th annual conference on neural information processing systems 2011. Proceedings of a meeting held 12–14 December 2011, Granada, Spain, pp. 702–710.Google Scholar
 Zhou, J., Chen, J., & Ye, J. (2011b). Malsar: Multitask learning via structural regularization. Tempe: Arizona State University.Google Scholar
 Zhou, J., Chen, J., & Ye, J. (2012). Multitask learning: Theory, algorithms, and applications. https://www.siam.org/meetings/sdm12/zhou_chen_ye.pdf