Imbalanced regression using regressor-classifier ensembles

We present an extension to the federated ensemble regression using classification algorithm, an ensemble learning algorithm for regression problems which leverages the distribution of the samples in a learning set to achieve improved performance. We evaluated the extension using four classifiers and four regressors, two discretizers, and 119 responses from a wide variety of datasets in different domains. Additionally, we compared our algorithm to two resampling methods aimed at addressing imbalanced datasets. Our results show that the proposed extension is highly unlikely to perform worse than the base case, and on average outperforms the two resampling methods with significant differences in performance.

1 3 2020), we argued that the same can be said for the distribution of a response one is interested in modelling for regression problems. The idea is that for a given response, certain parts of the distribution might be underrepresented, making it difficult for a model built using this data to make accurate predictions for unseen samples falling in the underrepresented regions. To this end, we proposed an ensemble learning approach which takes the underlying distribution into account.
Assume we have some learning data, a discretizer of choice, and a bin size of n. We build a regressor for the base case, where n = 1 . For all other values of n where n > 1 , and up to some maximum bin size, we split the learning data into n bins using the discretizer. We then build a regressor for each bin in each of the n independent subsets, and a classifier which predicts the probability that a new sample belongs in each of the bins. This classifier is used to combine the predictions from the independent regressors given an unseen sample. Given some unseen data, for each bin size n the regression models produce n predictions, one for each bin. These are combined using bin probabilities provided by the corresponding classifier. Finally, these combined predictions are aggregated using weights that are determined on the learning set. The intuition is that this process should leverage the predictive accuracy of the base and binned models, significantly reducing the possibility of performing worse than the base case, irrespective of the choice of discretizer, which improves upon our previous work . The proposed algorithm is described in Fig. 1 and is formalised in Sect. 3. We demonstrate empirically that it is highly unlikely that the proposed algorithm will perform worse than the base case.
In the next section, we discuss the different techniques comprising the different parts of the proposed extension and the algorithm in general. In Sect. 3 we describe the new algorithm in detail. The evaluation setup is described in Sect. 4 and the results are presented in Sect. 5. We discuss possible improvements to the proposed algorithm and conclude in Sect. 6.

Background
Ensemble learning, that is, the aggregation of the predictions made by multiple learning algorithms plays a crucial role in both the original algorithm and the extension we propose. Specifically, we use ensembles in (i) aggregating the predictions made within a given bin Fig. 1 Representation of the proposed approach when the maximum bin size is 3. Note that the weights ( w 1 , w 2 , w 3 ) are learned using cross-validation on the learning set using stacking. R j i represents the regressor for the jth bin when the bin size is i. p j i is the predicted probability by the C i classifier that a new sample belongs to the jth bin when bin size is i. In the diagram, R 1 1 is the base case regressor and does not have a classifier size that is not the base case (bin size = 1), and (ii) learning weights to combine the predictions made for each bin size. In the former, there is a classifier and i regressors when bin size is greater than 1. For a new sample, the regressors are used to predict the response while the classifier(s) are used to determine how much we can trust those predictions and weighting them accordingly, see Fig. 1. This differs from a more traditional ensemble learning procedure like stacking in that we do not generate meta-features on which a meta-learner is built, which is then used to aggregate the predictions. This scheme is more closely related to the concept of local classifiers in hierarchical classification (Silla & Freitas, 2011). Furthermore, this can be thought of as a form of dynamic weighting, where instead of having a separate meta-feature generation-aggregation step (Mendes-Moreira et al., 2012), dynamic weighting is implicit in the classifier-regressor aggregation paradigm.
We use a traditional stacking procedure to learn weights to combine the predictions for each bin size (from the base case, where the bin size = 1, to the determined maximum bin size). We generate a meta-feature for each bin size using cross-validation and determine the aggregating weights using a meta-learner. During the testing phase, these weights are applied uniformly to all samples in a test set, making it a static weighting procedure, in contrast to dynamic weighting, where a unique weight is learned for each individual test sample. This approach was chosen to reduce the overall computational complexity of the extended algorithm. However, it is possible that a dynamic weighting paradigm might improve predictive accuracy. We do not prune the meta-feature space, which is one of the main processes in a stacking procedure (Mendes-Moreira et al., 2012). This is because the meta-feature space is not expected to be very large and requiring pruning.
Response discretization is a fundamental part of our algorithm, for which supervised and unsupervised methods have previously been proposed (Dash et al., 2011;Dougherty et al., 1995). Although both supervised and unsupervised approaches are compatible with what we propose, we considered only unsupervised techniques in our evaluation to reduce the computational cost of our experiments. However, one can conjecture that if the supervised methods produce cleaner delineations between the different bins for a given response, it could improve overall predictive accuracy, as the class data on which the classifiers are built would be less noisy. There are other methods which also perform regression by means of classification in an ensemble setting-the work by Ahmad et al. (2018), Halawani et al. (2012) and Gonzalez et al. (2012), for example, are most closely related to what we propose. Ahmad et al. focus on extreme randomized discretization, and use the minimum or maximum of the training data points and bin boundary to estimate the prediction for a new sample, in contrast to our multi-bin and classifier-regressor paradigm. The approach by Gonzalez et al. focus on multi-variate spatio-temporal data and differ from what we propose in two important ways; (a) authors are interested in classifying bands of attributes prior to performing regression, and (b) the final prediction for an unseen sample is obtained by taking the median of the predicted values made by best models which are selected using cross-validation.
An alternative approach to treating imbalanced data is to redress ratio of "rare" and "common" values in a dataset by strategic resampling. Inspired by a similar problem of imbalanced classes in classification, Torgo et al. (2015) proposed two resampling strategies for continuous outcomes in regression analysis. Both methods rely on the notion of relevance (Torgo & Ribeiro, 2007), a function mapping output variable domain to [0,1]. Relevance is inversely proportional to probability density function of the output variable with rarer observations having higher relevance and more common observations having lower relevance. Choosing a relevance threshold allows one to divide the dataset into two subsets: 1 3 "common" observations and "rare" observations. Torgo et al. propose two schemes: (i) undersampling of common observations (undersampler) and (ii) undersampling of common observations combined with oversampling of rare observations (SmoteR). Both schemes attempt to create a more balanced dataset allowing for a better representation of values in low probability density areas. Undersampling consists of randomly selecting a subset of normal observations of a pre-specified size. In oversampling, additional "synthetic" rare observations are created by interpolation between existing rare cases and one of its k nearest neighbours. Any regressor can then be applied to a new, re-balanced dataset.
Both, undersampling and SmoteR, are honed towards datasets with rare values concentrated in tail(s) of the distribution. For such distributions, relevance function can be easily calculated and has one or two "relevance bumps"-areas of high relevance, local maxima. However, for more sophisticated distributions (e.g. bimodal), the user has to provide bespoke parameters for construction of the relevance function. This is in contrast to our method, where binning of data is automated via a chosen discretization method. We note that these resampling methods can also be used as part of our proposed framework (see Sect. 6).

Algorithm
The original algorithm is described formally in Algorithm 1 and the extension in Algorithm 2. However, we also provide an informal description of the extended algorithm below and a descriptive figure (see Fig. 2). Like the original algorithm, the proposed extension is also split into a build and a prediction phase. In the build phase, the required inputs are a learning set with a continuous response one is interested in modelling, a specified maximum bin size, an unsupervised discretizer, and classification and regression algorithms of one's choosing. The following steps are performed: 1. Discretize the learning set into n bins, where n runs from 1 to the maximum bin size.
Note that n = 1 is the base case and so no discretization is performed. For example, for a maximum bin size of 3, we end up with three response sets: set 1-base case, set 2learning set discretized into 2 bins, and set 3-learning set discretized into 3 bins (see Fig. 1). 2. Build a regressor for the base case. For response sets 2 up to the specified maximum bin size, build a single classifier which classifies whether a new sample belongs to a certain bin, and build a regressor for each bin. Therefore, a single classifier and two regressors are built for response set 2, a single classifier and 3 regressors are built for response set 3, and so on. 3. Learn weights that will be used for combining the predictions from the different response sets using the standard stacking procedure. That is, for each response set, generate a meta-feature using the previous two steps and learn combining weights that will be applied to the predictions made for unseen samples. Using these meta-features, also identify the single best performing bin size, which may be the base case.
It is worth noting that depending on the choice of discretizer and response, one can specify a maximum bin size that is too large, and the number of discretized bins fails to reach the specified size. In this case, the algorithm can be implemented in such a way that it automatically detects the maximum possible bin size before any models are built. This can be achieved by splitting the learning set response into random cross-validation folds, then holding out each fold and attempting to discretize the remaining values at the specified maximum bin size. The minimum discretization size achieved during this cross-validation procedure is chosen as the new maximum. This is how we implemented the algorithm in our evaluation. Given a new sample, the following is performed in the prediction phase: 1. Make predictions for the base case using the base case regressor. Then for response sets 2 up to the maximum bin size, make predictions using the regressors and aggregate them using the paired classifier. 2. After the previous step, one should be left with a number of predictions equal to the maximum bin size. These predictions are then aggregated using the weights from (3) in the building phase. The single best performing bin size can also be identified using the results from (3) in the building phase.

Considerations
A major consideration is that of practicality from a model building computational expense perspective. Imagine that one has a response they are interested in predicting and let s be the determined maximum bin size. This means that s − 1 classifiers and ∑ s l=1 l regressors will be built, making a total of (s − 1) + ∑ s l=1 l ∼ s 2 models, excluding any additional models built to determine the best hyperparameters for the chosen learning algorithms. Given the k cross-validation folds required to learn the combining weights, the total number of required models built during the learning phase is k((s − 1) + ∑ s l=1 l) + (s − 1) + ∑ s l=1 l ∼ ks 2 or O(s 2 ) models, which is more expensive than the 2s or O(s) models built in the original algorithm. Intuitively, the obvious benefits of this additional computational cost are that (i) one need not know the best performing bin size apriori, and can specify an arbitrarily large value for the maximum bin size, which the algorithm will automatically scale to the most viable, and (ii) between the weighted prediction and the identified best bin, one is guaranteed to never do significantly worse than the base case.
Implicit in this implementation is the assumption that one will use upsampling to resolve class imbalance issues. Empirical evidence from our previous work indicates that this is necessary, where we found that upsampling outperformed all other methods for handling class imbalance in this setting . However, replacing it with one's preferred method is trivial. The only other major decision which is left to the discretion of the user is the choice of discretizer, which can either be done using supervised or unsupervised approaches.

Algorithm 1 Federated Ensemble Learning using Classification. Adapted from
Input: Training set matrix L ∈ IR m×b , response vector y, bin size c, and test set matrix T ∈ IR n×b Output: Test set predictions Training: 1: Split y into c bins using a discretization technique of choice, producing L = (L 1 , . . . , Lc) and Y = (y 1 , . . . , yc) 2: for each bin c in L and Y do 3: Build a regressor Rc using Lc and yc 4: Build a classifier Cc using Lc as the positive samples and L − Lc as negative samples. Note: class balancing may be required 5: end for Prediction: 6: for each c and regressor-classifier pair Rc and Cc do 7: Predict the response for T using Rc 8: Predict the probability that the samples in T belong in c using Cc 9: end for The process above generates predicted response and probability matrices R, P ∈ IR n×c 10: vn = c j=1 p n,j 11: Create weight matrix W ∈ IR n×c by dividing all elements in each row i in P by v i 12: Create weighted response matrix R w ∈ IR n×c by performing the element-wise multiplication of R and W 13: The final prediction T = c j=1 r w n,j 14: return T Orhobor et al., 2020.
Input: Learning set matrix L ∈ IR m×d , response vector y, s maximum bin size, k crossvalidation folds, and test set matrix T ∈ IR n×d Output: Weighted test set predictions and best performing individual bin size predictions Validate maximum bin size: 1: Validate maximum bin size s. Randomly split y into k folds. Discretize each y k , where bin size is set to s. For some discretizers and y k , the maximum number of possible bins is less than s, set s to the minimum of bins from discretizing all y k .
Learn combining weights: 2: for each 1 . . . k folds do 3: Split L into a train (L T ) and validation (L V ) set, where V represents the samples in k, and T represents the remaining samples in L 4: Split y T into i response sets for all i = 1, . . . , s, where i = 1 is the base case and no discretization is performed, but for 2 ≤ i ≤ s discretize y T into i bins. 5: for i in 1, . . . , s response sets do 6: if i = 1 (base case) then 7: Build a regressor for the base case R i using L T 8: Make predictions for L V using R i , producing a single prediction vector r i 9: else 10: Build a classifier C i using L T , where each bin b is a class, upsampling the minority class when necessary 11: Build a regressor for each bin if i = 1 (base case) then 25: Predict the response r i for T using R i=1 26: else 27: Predict the response for T using all R b i , where b = 1, . . . , i. This produces a response matrix R ∈ IR n×i 28: Predict the probability that the samples in T belong the b bins. This should produce a classification matrix C ∈ IR n×i 29: The final response The process above generates a predicted response matrix P ∈ IR n×s 32: The final weighted prediction W is gotten by applying the learned weights w from step 17 to P above. The prediction S, which should be treated as the best performing individual response set is trivially identified by comparing each meta-feature which maps to a particular bin size in M on step 17 against the true values in the learning set. 33: return W and S 1 3 4 Evaluation setup

Datasets
We evaluated the proposed algorithm and the two resampling methods using 119 responses from a variety of datasets. Specifically, we considered the 20 genes from the LINCS Phase II dataset (Koleti & Terryn, 2017) which we used in the evaluation of the original algorithm (Algorithm 1), 46 traits from the yeast dataset used by Grinberg et al. (2020), 24 drug targets from quantitative structure-activity relationship problems (Olier et al., 2018), 14 responses from the OpenML AutoML Regression Benchmark (Bischl et al., 2017), and 15 responses from work by Branco et al. (2019). We refer to these dataset collections as gene expression, yeast, QSAR, OpenML, and Branco, respectively, during the discussion of our evaluation in Sect. 5. The exact versions of these datasets used in our experiments are available for download from http:// dx. doi. org/ 10. 17632/ mpvwn hv4vb.2. We used a 70-30% learning and test set split in our evaluation. See Tables in "Appendix A" for the distribution of imbalance in the considered responses as described by Branco et al. (2019).

Learning setup
We considered the unsupervised frequency interval and clustering (k-means) discretizers (Dougherty et al., 1995). The specified maximum bin size was 10, 5-fold crossvalidation was used in the internal stacking procedure to learn weights and identify the best performing individual bin size on the learning set. We used convex linear regression as our weight learning algorithm, which is multiple linear regression with the constraint that its coefficients are non-negative and must sum to 1 (Breiman, 1996). For the classifiers, we used random forests (RF) (Breiman, 2001), naive bayes (NB) (Rish, 2001), decision trees (DT) (Quinlan, 1986), and k-nearest neighbours (KNN) (Altman, 1992). For the regressors, we used the least absolute shrinkage and selection operator (LS) (Tibshirani, 1996), ridge regression (RR) (Hoerl & Kennard, 1970), RF, and eXtreme gradient boosting (XGB) (Chen & Guestrin, 2016). The RF models were built with 1000 trees and a third of the number of attributes was used as the number of splitting variables considered at each node, the defaults were used for everything else. The KNN models were tuned for the k parameter (the number of neighbours) over a range of values: (1,3,5,7,11,13,15). The regularization parameter for both LS and RR was tuned using 10-fold internal cross-validation. Lastly, the learning rate of the XGB models were tuned with (0.001, 0.01, 0.1, 0.2, 0.3) and the number of rounds set to 1500. Default values were used for all other parameters. Note that, out of the four regression models, only RF is capable of handling non-binary categorical input variables. Hence, LS, RR and XGB were not applied to datasets which included such inputs. Predictive performance was measured as the coefficient of determination ( R 2 ). All experiments were performed in R (Ihaka & Gentleman, 1996), and the code used in our experiments is available at https:// github. com/ oghen ejokp eme/ EFERUC. All our results are available for independent analysis at http:// dx. doi. org/ 10. 17632/ mpvwn hv4vb.2.

SmoteR and undersampler
We compared our proposed method with two resampling approaches from Torgo et al. (2015), SmoteR and undersampler. For SmoteR, we used a relevance threshold of 0.7, k = 5 and all combinations of undersampling rate of (50, 100, 200, 300)% and oversampling rate of (200, 500)% . For undersampler, we used a relevance threshold of 0.7 and resampling rates of (50, 100, 200, 300)% . These combination of parameters were used in the original paper (Torgo et al., 2015). The results from the best performing parameters are reported. We applied the four regression methods (LS, RR, RF and XGB), which were tuned as detailed above, to each resampled dataset and reported the R 2 on a test set for the best combination of resampling parameters. For majority of outputs, the relevance function was generated automatically (the phi function in the uba package in R) with either one or two relevance bumps corresponding to tail(s) of the distribution. However, some outputs required bespoke relevance functions with areas of importance identified through inspection of the output histograms. We note that the undersampling algorithm and SmoteR failed in the instances when no extreme values were identified by the relevance function and in some rare cases where a dataset contained discrete or categorical input variables with very low-frequency values.

Base regressor performance
We found that the overall best performing regressor in the base case for the frequency and clustering based discretization approaches was XGB, with RF a close second (see Tables 1 and 2). One might ask why we have separated the base case results by discretizer given that the results should be the same either way. This is because we only included base case results where the proposed approach could be successfully executed for at least one regressor-classifier pair given a particular discretizer. This is important because the learning process can fail depending on how a response is discretized. Failure in this sense can be caused by either the inability to discretize a response beyond a bin size of 1, or the discretization could be performed, but all the values in one bin were Table 1 Base case mean predictive performance ( R 2 ) for the regressors we considered in our evaluation where the discretization approach is frequency for the data collections we considered Note that for each data collection, we only included results from responses where (1) performance values are available for all learners, and (2) each regressor could be successfully paired with at least one other classifier without failure. The number of experiments for each data subset is given and the best performing regressor is in boldface the same, causing some learners to fail. It is also worth noting that we did not perform experiments for base regressors which only support numeric input variables but the input dataset under consideration has nominal variables with more than two classes. For example, we did not build a LS regressor for fuelCons dataset in the Branco collection (see Table 11 in "Appendix A"). Paired t-tests showed that there is significant difference in performance between the regressor pairs in the base case for some dataset collections but not for others (see Tables 3 and 4).

Classifier performance
We observed that the best performing classifier, when paired with a particular regressor, is largely dependent on the discretization approach and varies by dataset collection. For example, RF is the best performing classifier on the Branco dataset collection for all regressors when frequency is the discretization approach. This is in contrast to when clustering is used for discretization. In this case, KNN is the best classifier for LS and RR, RF is the best classifier for the RF regressor, and NB and DT perform equally well when paired with XGB (see Tables 5 and 6). Given these results, we argue that for a new unseen problem, one would need to also select the best performing classifier that should be paired with a given regressor. This can be done using any standard model selection approach.

Discretizer performance
Our results suggests that the effect of the chosen discretizer is dependent on the response and the regressor-classifier pair. We only see a statistically significant difference in performance between the frequency and clustering discretizers in 13 of the 80 regressorclassifier pairs across the dataset collections we considered. However, the choice of discretizer might be worth optimising given a new problem, as the analysis for the gene expression dataset collection indicates significant effects on performance (see Table 7). Table 6 Mean predictive performance ( R 2 ) of the best performing of the best individual bin size and weighted cases for each regressor-classifier pair we considered when the discretization approach is clustering Performance values less than zero and missing values (cases where the classifier building process failed) were replaced with 0s. For each data subset, the number of experiments for each regressor n is given, and the best performing classifier is in boldface

Algorithm performance
We compared the base regressor performance to the best performing regressor-classifier pair for the frequency and clustering discretizers. Note that the values for the regressor-classifier pairs here are the best performing of either the best performing single bin or the weighted case. For example, assume a predicted response and LS as the base regressor. We compared the base LS performance value to the best performing pair out of LS-NB, LS-RF, LS-DT, and LS-KNN, where the value for each of these pairs is the best performer out of the best performing single bin and the weighted case. There were a total of 853 performance data points. Our approach performed as well as the base case on 134 and strictly outperformed it on 709. The base case never outperforms our proposed approach. For the cases in which our proposed approach outperformed the base case, the average increase in predictive performance ( R 2 ) is Table 7 The percentage of regressor-classifier pairs for which the considered discretization approaches (frequency/clustering) strictly outperforms the other Note that the value for each regressor-classifier pair in this comparison is the best performer of the best bin or weighted case. The number of experiments (n) for each pair is given, and an asterisk ( * ) is given where there is a statistically significant difference in performance between the discretization approaches on a paired t-test 0.04 ± 0.06 (an average increase of performance of ∼35%). We performed significance testing using paired t-tests for each individual regressor given the two discretizers we considered. This analysis showed that the difference in performance is statistically significant for the four regressors with a significance level of 0.01 (see Table 8 and Fig. 3). The most marked improvement in performance is observed for the two linear methods, LS and RR, for both discretization approaches, whilst performance of RF and XGB is very similar for base and best regressor-classifier pair. These results suggest that the proposed approach (a) is capable of dealing with the problem of imbalanced distributions in a target response, and (b) is highly unlikely to perform worse than the base case.

Comparison to resampling approaches
We compared the best performing resamplers (SmoteR and undersampler) to the best performing regressor-classifier pair for the frequency and clustering discretizers. Also note that, as in the previous section, the values for the regressor-classifier pairs here are the best performing of either the best performing single bin or the weighted case. We observed that for all combinations of the resamplers, discretizers, and dataset collections, our approach outperforms the resampling methods (see Table 9 and Fig. 4). Paired t-tests show that these differences in performance are statistically significant with a significance level of 0.01 (see Table 10).

Discussion and conclusion
Although we have demonstrated the utility of the proposed algorithm, we conjecture that modifications to three main components could further improve the predictive accuracy of the algorithm's output at some additional computational cost. These components  n is the number of samples that were compared, eq is the number of times a resampling and discretization approach perform equally well, lt is the number of times a resampling approach outperforms a discretization approach, lt_avg is the average performance increase ( R 2 ) and standard deviation of the resampling approach over a discretization approach, and gt_avg is the average performance increase ( R 2 ) and standard deviation of a discretization approach over a resampling approach. The overall best performing approach for each comparison (  are the method by which class imbalance is handled, the type of discretizer, and how weighting is performed in the internal stacking procedure. In our evaluation, we used simple upsampling to deal with class imbalance, however, we expect that a more sophisticated but more computationally expensive approach like SMOTE (Chawla et al., 2002) would perform better. The reasoning behind this is that building the classifiers with higher quality samples will lead to improved classification accuracy of the models, which will in turn improve the classification accuracy of the overall system. For the discretizers, we only considered unsupervised discretizers because of their negligible computational cost, however, we also expect overall performance improvement of the proposed algorithm with supervised discretizers. We think that supervised discretizers would produce cleaner delineations between the different bins, thus reducing the amount of noise between bin regions, improving the predictive accuracy of the classifiers. One might also choose to use a bespoke discretization approach that is tailored to a given problem using domain knowledge, which may outperform the mainstream supervised and unsupervised discretization techniques. Lastly, rather than the static weighting scheme we used in aggregating the predictions from the different bin sizes, it is possible that a dynamic weighting approach might perform better. One might argue that depending on their choice for the internal components (discretizer, class balancing, regressor, classifier, etc), the computational cost (finance and time) of the proposed approach might be limiting. However, we argue that this is not the case, given that compute costs are cheap and are steadily declining, and certain aspects of the algorithm can be parallelised, which should dramatically reduce computation time.
In summary, we have presented an extension to the federated ensemble regression using classification algorithm. Our evaluation shows that it is highly unlikely to perform worse than the base case, and it outperforms state-of-the-art resampling approaches intended to address distribution imbalance in target responses for regression problems.