1 Introduction

Data mining and machine learning are the key technologies for molecular pheno- and genotyping as required for personalized medicine (Kraus et al. 2018). As those techniques are designed for high-dimensional data, they extend the human capability of extracting and aggregating (molecular) patterns from these profiles of tens of thousands of measurements. Especially in medical applications, the resulting diagnostic models are required to be both accurate and interpretable. Human experts should be able to comprehend and intervene in automated decisions and their consequences. Furthermore, interpretable decision models also aid in generating hypotheses on the molecular causes or mechanisms of disease.

One of the most prominent techniques for improving the interpretability of high-dimensional data is feature selection (Guyon and Elisseeff 2003). It constructs low-dimensional signatures of primary measurements selected from the original high-dimensional profiles. Often these signatures are the basis for all subsequent processing steps and therefore the only interpretation of the final model. The chosen features describe the discriminative outline of the underlying molecular network.

Although they have great potential, the use of data mining and machine learning techniques can have limitations for molecular data. Due to ethical, economic or technical reasons, sample collections are typically limited in size leading to a high contrast of feature and sample numbers (\(n \gg m\)). This imbalance causes various effects summarized under the issue of the curse of dimensionality (Bellman 1957). For example, the linear separability of m data point dichotomies increases with the dimensionality n (Cover 1965). Simultaneously, the Euclidean distances among data points become less distinguishable, which can affect the reliability of neighbourhood networks (Hinneburg et al. 2000). The imbalance mainly influences the possibility of precise and unique parameter estimation due to the high dimensionality of the search space (Bühlmann and van de Geer 2011). For classification models, the complexities of model classes increase with the dimensionality n (Kearns and Vazirani 1994), leading to overfitting and decreased generalization performance in high-dimensional settings.

Here, learning tasks can be improved by two primary strategies. The first one is to focus on classification models that are designed for operating on a relatively small set of samples. It comprises the selection of fast converging learning algorithms and the regulation of model complexity (Vapnik 1998). The second one is the acquisition of additional information and data sources for guiding a training process. This strategy of integrating domain knowledge comprises a broad spectrum of options that can be used for navigating through the search space. Knowledge of the relationships of diagnostic classes can outline their neighborhood (Lattke et al. 2015). Known interactions of univariate features can highlight multivariate processes (Taudien et al. 2016; Lausser et al. 2016a). Identified sources of noise might be counteracted a priori (Lausser et al. 2016b).

Additional data sources can be used to extract domain knowledge within the training process of a classifier. They might comprise additional samples of the original diagnostic classes as well as unlabeled samples or samples from different categories. While samples from the original classes might be seen as a traditional extension of the training data, the other two options lie beyond the scope of supervised learning. Unlabeled samples can be incorporated via partially supervised learning techniques, such as transductive or semi-supervised learning (Vapnik 1998; Chapelle et al. 2010; Lausser et al. 2014). Samples of different classes are utilized, for example, in transfer or multi-task learning approaches (Pan and Yang 2010; Caruana 1997).

In our previous work, we have systematically investigated the potential of transferring specific feature signatures from one molecular classification task to another (Lausser et al. 2018a). We have shown that multi-class classifier systems can utilize this strategy for achieving highly accurate multi-categorical predictions (Lausser et al. 2018b). In this work, we propose a correlation-based feature selection criteria for binary classification tasks that utilize foreign classes for extending their training sets.

2 Methods

In the following, we will view classification as the task of assigning an object to a class selected from a predefined set of distinct classes \(y \in \mathcal {Y}\) according to a set of measurements \(\mathbf {x}\in \mathcal {X} \subseteq \mathbb {R}^{n}\)

$$\begin{aligned} c:\mathcal {X} \longrightarrow \mathcal {Y}. \end{aligned}$$

We restrict ourselves to binary classification tasks (\(|\mathcal {Y}|=2\)). The original classification function c is typically unknown. It has to be reconstructed in a data-driven learning procedure

$$\begin{aligned} l: \mathcal {C} \times \mathcal {T} \rightarrow c_{\mathcal {T}} \in \mathcal {C}. \end{aligned}$$

Here, \(\mathcal {C}\) denotes an a priori chosen concept class and \(\mathcal {T}=\{(\mathbf {x}_{i}, y_{i})\}_{i=1}^{|\mathcal {T}|}\) a set of labeled training examples. Subscript \(_{\mathcal {T}}\) will be omitted for simplicity. In the classical supervised scheme, it is assumed that the training set \(\mathcal {T}\) comprises only samples related to the current classification task \(\forall i : y_{i} \in \mathcal {Y}\). This assumption can be restrictive as data collections might also consist of additional samples of other related classes. In the following, we assume our original classification task to be embedded in a larger context comprising additional and distinct classes \(\mathcal {Y}' \supset \mathcal {Y}\) as they are collected in multi-class classification tasks. Samples \((\mathbf {x},y)\in \mathcal {T}\) can be representative for each of these classes \(y \in \mathcal {Y}'\). The training algorithm will be allowed to utilize all samples in \(\mathcal {T}\). If a subprocess requires only a subset of two classes a and b this will be denoted as

$$\begin{aligned} \mathcal {T}_{ab} = \bigl \{(\mathbf {x},\mathbb {I}_{\left[ y=b\right] }) \mid (\mathbf {x},y)\in \mathcal {T}, y\in \{a,b\} \bigr \}. \end{aligned}$$

In this case, the original class labels \(\left\{ a,b\right\} \) are replaced by labels \(\left\{ 0,1\right\} \) for simplicity.

In this work, we propose to utilize the full training set \(\mathcal {T}\) for the internal feature selection process of the classifier training. The restricted set \(\mathcal {T}_{ab}\) will be used for the final adaptation of the classification model. The trained classifier will later on be tested on an analogously restricted validation set \(\mathcal {V}_{ab}\).

2.1 Feature selection

Especially in high-dimensional settings, the training procedure of a classification rule can incorporate a feature selection process excluding features that are believed to be noisy, uninformative or even misguiding for the adaptation of the classifier. This selection process is typically implemented as a data-driven procedure yielding at the selection of \(\hat{n} \le n\) feature indices

$$\begin{aligned} \mathbf {i} \in \mathcal {I} = \{\mathbf {i} \in \mathbb {N}^{\hat{n} \le n} \, | \, i_{k-1} < i_{k}, \, 1 \le i_{k} \le n\}. \end{aligned}$$

Subsequent training steps and the final classification model will operate on the reduced feature representation

$$\begin{aligned} \mathbf {x}^{(i)}=(x^{(i^{(1)})}, \ldots , x^{(i^{(\hat{n})})})^{T}. \end{aligned}$$

In the following we concentrate on univariate feature selection strategies. That is each feature is assessed via a quality score s(i) that does not take into account interactions with other candidate features. The selection is based on a vector of quality scores

$$\begin{aligned} \mathbf {s} = \bigl (s(1),\ldots ,s(n)\bigr )^{T}. \end{aligned}$$

The top \(\hat{n}\) features with the best scores are selected

$$\begin{aligned} \mathbf {i} = \mathrm {top}_{\hat{n}}(\mathbf {s}) = \bigl \{ (\ldots ,i,\ldots )^{T} \mid \mathrm {rk}_{\mathbf {s}}(s(i)) \le \hat{n} \bigr \}, \end{aligned}$$

where \(\mathrm {rk}_{\mathbf {s}}\) denotes the ranking function of the elements in \(\mathbf {s}\).

2.2 Foreign classes in feature selection

As we want to analyze the possibility of utilizing foreign classes for feature selection, the chosen score will not only be evaluated for the original pair of classes a and b but also for other classes \(o \in \mathcal {Y}\setminus \{a,b\}\). The corresponding set of scores (obtained for the ith feature) will be denoted by \(\mathcal {S}(i)\). The cardinality of \(\mathcal {S}(i)\) depends on the chosen strategy for selecting pairs of classes. For aggregating the scores in \(\mathcal {S}(i)\), we will the utilize following three strategies:

$$\begin{aligned} s_{\mathrm {max}}(i)= & {} \mathrm {max} \bigl (\mathcal {S}(i) \bigr ), \end{aligned}$$
$$\begin{aligned} s_{\mathrm {mean}}(i)= & {} \mathrm {mean} \bigl (\mathcal {S}(i) \bigr ), \end{aligned}$$
$$\begin{aligned} s_{\mathrm {min}}(i)= & {} \mathrm {min} \bigl (\mathcal {S}(i) \bigr ). \end{aligned}$$

We utilize \(s_{\mathrm {for}}\in \{s_{\mathrm {min}},s_{\mathrm {mean}},s_{\mathrm {max}}\}\) to denote a general foreign feature selection strategy. A classical feature selection strategy is denoted as \(s_{\mathrm {orig}}\).

Here, we construct \(\mathcal {S}(i)\) from various scores based on the empirical Pearson correlation between an individual feature and class label. For the ith feature and a fixed pair of classes \(a,b \in \mathcal {Y}\), it is given by

$$\begin{aligned} cor_{\mathcal {T}_{ab}}(i) = \frac{\frac{1}{|\mathcal {T}_{ab}|-1}\sum _{(\mathbf {x},y)\in \mathcal {T}_{ab}} (x^{(i)}-\bar{x}^{(i)}) (y-\bar{y})}{\sqrt{\frac{1}{|\mathcal {T}_{ab}|-1}\sum _{(\mathbf {x},y)\in \mathcal {T}_{ab}} (x^{(i)}-\bar{x}^{(i)})^2}\sqrt{\frac{1}{|\mathcal {T}_{ab}|-1}\sum _{(\mathbf {x},y)\in \mathcal {T}_{ab}} (y-\bar{y})^2}},\quad \end{aligned}$$

where \(\bar{x}^{(i)}\) denotes the observed mean value of the ith feature and \(\bar{y}\) denotes the average class label.

More precisely, we investigate a score based on the Pearson correlations of foreign classes \(o \in \mathcal {Y}\setminus \{a,b\}\) to both original classes a and b,

$$\begin{aligned} \mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i) = \frac{\mathrm {cor}_{\mathcal {T}_{ao}}(i) + \mathrm {cor}_{\mathcal {T}_{ob}}(i)}{2} \in \left[ -1;+1 \right] . \end{aligned}$$

In the following we call this score chained correlation. The corresponding sets of scores are given by

$$\begin{aligned} \mathcal {S}_{ccor}(i) = \Bigl \{ |\mathrm {ccor}_{\mathcal {T}_\mathrm {ao},\mathcal {T}_\mathrm {ob}}(i) | \, \Big | \, o \in \mathcal {Y}\setminus \{a,b\} \Bigr \}. \end{aligned}$$

The score \(\mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i)\) is high, if foreign class o fulfils two conditions simultaneously and high correlations are obtained when related to classes a and b. Note that o is the second class in \(\mathrm {cor}_{\mathcal {T}_{ao}}(i)\) and the first one in \(\mathrm {cor}_{\mathcal {T}_{ob}}(i)\). As

$$\begin{aligned} \forall \, c,d \in \mathcal {Y}, c\ne d: \mathrm {cor}_{\mathcal {T}_{cd}}(i) = -\mathrm {cor}_{\mathcal {T}_{dc}}(i), \end{aligned}$$

class o will lead to high correlations under opposite conditions.

For \(\mathrm {cor}_{\mathcal {T}_{ao}}(i)\), high positive correlations are achieved if the values of class a are lower than those of class o. For \(\mathrm {cor}_{\mathcal {T}_{ob}}(i)\), the values of class b are required to be higher.

Top scores of \(\mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i)\) are achieved, if the samples of class o (projected on the ith feature) lie in between the samples of classes a and b. A high value of \(\mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i)\) implies that the values of class a are lower than those of class b and therefore indicate high values of \(\mathrm {cor}_{\mathcal {T}_{ab}}(i)\). The score \(\mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i)\) can therefore be seen as a surrogate for \(\mathrm {cor}_{\mathcal {T}_{ab}}(i)\). As an analogous argumentation can be given for high negative correlations, we have chosen to consider the absolute value \(|\mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i)|\) in our experiments.

3 Experiments

We evaluate \(\mathcal {S}_{ccor}\) using the aggregation schemes \(s_{\mathrm {for}} \in \{s_{\mathrm {max}},s_{\mathrm {mean}},s_{\mathrm {min}}\}\) in experiments with 9 multi-class datasets comprising multiple instances (\(m \ge 59\), \(|\mathcal {Y}|\ge 4\)). Each dataset was collected for a specific research question and is therefore analysed independently. This research question can be regarded as the common semantical (and biological) context of the classes \(\mathcal {Y}\). A summary of these datasets can be found in Table 1. All datasets consist of gene expression profiles (\(n\ge 8740\)). Each feature corresponds to the expression level of a mRNA molecule of a biological sample. Within each dataset the biological samples were prepared according to identical laboratory protocols.

We perform experiments utilizing all pairs of classes \(a,b \in \mathcal {Y}\) as original classes. For an individual dataset, the experimental setup therefore consists of \(\frac{|\mathcal {Y}|(|\mathcal {Y}|-1)}{2}\) settings. As a reference score, the absolute Pearson correlation is chosen \(s_{\mathrm {orig}}=|cor_{\mathcal {T}_{ab}}|\).

Table 1 Table of datasets

3.1 Empirical characterization of chained correlations

In order to characterize the relations of the foreign aggregation schemes \(s_{\mathrm {for}}\) and the original score \(s_{\mathrm {orig}}\) we provide their empirical joint distributions over all n features gained on datasets \(d_{1}-d_{9}\). A detailed example for all individual foreign scores \(\mathcal {S}_{\mathrm {ccor}}\) and the foreign aggregation schemes \(s_{\mathrm {for}}\) is shown for dataset \(d_{8}\). Additionally we show examples for high scoring features of \(s_{\mathrm {for}}\) gained on datasets \(d_1-d_9\).

3.2 Classification experiments

We also compare the classification accuracies \(a_{\mathrm {for}}\) gained by the use of foreign aggregation strategies \(s_{\mathrm {for}}\) to the classification accuracy \(a_{\mathrm {orig}}\) gained by the original feature selection score \(s_{\mathrm {orig}}\). As classification algorithms linear support vector machines (Vapnik 1998) (SVM, \(cost = 1\)), random forests (Breiman 2001) (RF, \(ntree = 500\)) and k-nearest neighbor classifiers (Fix and Hodges 1951) (k-NN, \(k = 3\)) were chosen. The accuracies are estimated in stratified \(10 \times 10\) cross-validation (\(10 \times 10\) CV) experiments (Japkowicz and Shah 2011). All experiments are performed in the TunePareto-Framework (Müssel et al. 2012).

For each multi-class dataset with \(| \mathcal {Y} |\) classes we analyze \(\frac{| \mathcal {Y} | (| \mathcal {Y} | -1 )}{2}\) two-class classification tasks. For each aggregation strategy \(s_{\mathrm {for}}\), feature signatures of \(\hat{n} \in \{ 25,50 \}\) features are generated based on the samples of the remaining \(|\mathcal {Y}|-2\) classes.

Fig. 1
figure 1

Comparison of the original score \(s_{\mathrm {orig}}\) and the foreign scores \(s_{\mathrm {for}}\) for dataset \(d_{8}\). Each scatterplot shows the profile of all \(n=8740\) features of \(d_{8}\). \(a=y_{1}\) and \(b=y_{5}\) were chosen as original classes. The remaining classes were used as foreign classes \(o_{1}=y_{2}\), \(o_{2}=y_{3}\), \(o_{3}=y_{4}\). The first row shows the individual foreign scores \(s_{\mathrm {o}}\) for \(o_{1} - o_{3}\). The second row gives the aggregated scores \(s_{\mathrm {min}}\), \(s_{\mathrm {mean}}\), \(s_{\mathrm {max}}\)

Fig. 2
figure 2

Differences of the original score and the foreign scores \(s_{\mathrm {orig}}-s_{\mathrm {for}}\). The figure gives the empirical differences between the original score \(s_{\mathrm {orig}}\) and the aggregated foreign scores \(s_{\mathrm {min}}\), \(s_{\mathrm {mean}}\), \(s_{\mathrm {max}}\) observed for each dataset \(d_{1}-d_{9}\). The results for the whole feature profiles and for all pairs of original classes \(a,b \in \mathcal {Y}\) are aggregated for each dataset. The foreign scores \(s_{\mathrm {for}}\) were discretized in intervals of width 0.1. Panel (a) gives the ranges of differences in dependency of \(s_{\mathrm {for}}\). Panel (b) can be seen as a histogram showing the percentage of features in a specific interval of \(s_{\mathrm {for}}\). Each bar is split in the percentage of features that over- or underestimated the original score \(s_{\mathrm {orig}}\)

Fig. 3
figure 3

Visualization of expression values of high scoring features for \(d_1, \ldots , d_9\) using the aggregation strategy \(s_{\mathrm {max}}\). For every dataset, the original classes \(y_a, y_b \in \mathcal {Y}\) are shown at the top, the value of \(s_{\mathrm {orig}}\) is shown to the right. For every foreign class \(y_c \in \mathcal {Y} \setminus \{ y_a, y_b \}\), the expression values of this single class are shown on an axis and the value of \(s_{\mathrm {for}}\) to the right. The values of \(s_{\mathrm {for}}\) are sorted in descending order. For each dataset, the limits (minimum and maximum) of the expression values are shown at the bottom

4 Results

4.1 Empirical joint distributions of chained correlations

Figure 1 shows an example of the individual and aggregated chained correlations for \(d_{8}\). An overview for all datasets is given in Fig. 2. The differences \(s_{\mathrm {orig}} - s_{\mathrm {for}}\) are shown. The highest scores for \(s_{\mathrm {for}}\) occur with \(s_{\mathrm {max}}\). Over all datasets, the average ranges are \(s_{\mathrm {max}} \in [0.00,0.84]\), \(s_{\mathrm {mean}} \in [0.00,0.74]\), \(s_{\mathrm {min}}\in [0.00,0.67]\).

Panel (a) shows boxplots of these differences in fixed intervals of \(s_{\mathrm {for}}\) and panel (b) the corresponding fraction of features. It can be observed that for the majority of features \(s_{\mathrm {orig}}\) is underestimated by the use of \(s_{\mathrm {for}}\). We quantify this observation by counting all features that over- and underestimate \(s_{\mathrm {orig}}\). Over all datasets, \(s_{\mathrm {orig}}\) is underestimated by \(s_{\mathrm {min}}\) in 98.52% of all cases and overestimated in 1.47%. For \(s_{\mathrm {mean}}\), an underestimation was observed for 92.24% of all experiments. \(s_{\mathrm {orig}}\) was overestimated in 7.76%. The aggregation scheme \(s_{\mathrm {max}}\) leads to an underestimation of \(s_{\mathrm {orig}}\) in 81.46% of all cases and to an overestimation in 18.54%.

Figure 3 shows the expression values of high scoring features for \(s_{\mathrm {max}}\) for each dataset \(d_1-d_9\). It can be seen that for all these features an underestimation of \(s_{\mathrm {orig}}\) holds true. Differences from \(s_{\mathrm {max}}\) to \(s_{\mathrm {orig}}\) up to 0.21 (\(d_7\)) can be observed. In mean over all datasets, these differences are 0.11. Comparable figures for \(s_{\mathrm {min}}\) and \(s_{\mathrm {mean}}\) can be found in the supplement.

Fig. 4
figure 4

Panels showing how classifiers using \(s_{\mathrm {for}}\) compare to the same classifiers using \(s_{\mathrm {orig}}\) in \(10 \times 10\) cross-validation experiments. For \(\hat{n} = 25\) (panel A) and \(\hat{n} = 50\) (panel B), stacked barplots are depicted according to the aggregation strategies \(s_{\mathrm {for}} \in \{s_{\mathrm {min}}, s_{\mathrm {mean}}, s_{\mathrm {max}} \}\). Rows indicate the datasets \(d_1, \ldots , d_{9}\) and columns the classification algorithms SVM, 3NN and RF. Each individual barplot shows how often a classifier utilizing \(s_{\mathrm {for}}\) gains a better, equal or lower accuracy \(a_{\mathrm {for}}\) compared to its counterpart \(a_{\mathrm {orig}}\) utilizing \(s_{\mathrm {orig}}\). Under each clasification algorithm, the mean values over all comparisons are presented

4.2 Evaluation of \(10\times 10\) cross-validation experiments

A comparison of the accuracies achieved in the \(10 \times 10\) CV experiments is given in Fig. 4. All results are shown in triplets indicating the number of wins, ties and losses (w/t/l) for a specific dataset. Panel A shows the result for \(\hat{n}=25\) features. Over all datasets, best results were achieved for \(s_{\mathrm {max}}\). For the SVM, \(s_{\mathrm {max}}\) achieved better results for 37.71% of all experiments (t = 24.74%/l = 37.55%). For the 3NN it gained better results for 47.37% (t = 14.53%/l = 38.10%). For RF, it was better in 40.07% of all cases (t = 26.14%/l = 33.79%).

Comparable results were observed for \(s_{\mathrm {mean}}\). Here the SVM based on \(s_{\mathrm {mean}}\) achieved higher accuracies in 38.43% of all experiments (t = 23.55%/l = 38.02%). The 3NN gained better results in 36.93% of all cases (t = 11.35%/l = 51.72%). The RF was better in 32.01% of all settings (t = 26.05%/l = 41.94%).

The lowest performance was achieved by \(s_{\mathrm {min}}\). For the SVM, \(s_{\mathrm {min}}\) won 31.12% comparisons (t = 23.82%/l = 45.05%). For 3NN, it achieved better results for 35.72% of all cases (t = 8.90%/l = 55.39%). For RF, 31.07% wins (t = 22.78%/l = 46.14%) were observed.

In Panel B the results for \(\hat{n}=50\) features are given. Here, the ranking of the foreign scores is similar to the ranking observed for \(\hat{n}=25\) features. Best results were gained for \(s_{\mathrm {max}}\). The corresponding SVM was better for 40.92% of all experiments (t = 26.10%/l = 32.98%), 3NN gained better results in 48.42% (t = 16.53%/l = 35.05%) and the RF outperformed its counterpart in 40.48% (t = 25.69%/l = 33.83%).

Coupled to \(s_{\mathrm {mean}}\), the SVM achieved better performance in 38.38% of all experiments (t = 26.39%/l = 35.22%). The 3NN shows higher accuracies in 38.62% of all cases (t = 13.61%/l = 47.77%). The RF had higher accuracies in 34.49% of all settings (t = 24.35%/l = 41.17%).

For \(s_{\mathrm {min}}\) again the lowest performances were achieved. For SVM, it achieved better results for 35.50% of all cases (t = 26.16%/l = 38.34%). For 3NN, 28.22% wins (t = 15.15%/l = 56.63%) were observed. For the RF, the \(s_{\mathrm {min}}\) won 32.06% comparisons (t = 23.80%/l = 44.14%).

5 Discussion and conclusion

In this work, we analyzed the use of chained correlations for incorporating foreign classes in the feature selection processes of binary diagnostic tasks. Here, samples of the original diagnostic classes are related to those of a foreign class for each feature. The chained correlations link the first original class to the foreign class and the foreign class to the second original class. A high score indicates a central position of the foreign class and almost disjunct positions of the original ones. A chained correlation typically underestimates the correlation gained for the original classes. The first one, therefore, might be used as a surrogate marker for the second one.

However, it might be unclear which foreign classes should be used. In our experiments, the original classes, as well as the foreign classes, were selected from the same multi-class data collection. The underlying biological samples were collected for a specific scientific purpose and preprocessed according to identical laboratory protocols. They are therefore comparable and share common semantical (or biological) context. These constraints might be too strict as we mainly screen for (feature-wise) central classes that imply well separable categories. Only the original diagnostic classes require an identical preprocessing to guarantee the correctness of implications from high chained correlations. However, we assume that both constraints increase the probability of detecting foreign feature-wise central classes.

We proposed three scores for combining chained correlations of multiple foreign classes. The minimum, mean or maximum score are discussed which consider a different number of foreign classes. While a high minimum score requires all foreign classes to lie in between the original classes, only one central foreign class is required for a high maximum score. The minimum criterion, therefore, models the original classes as outlying classes implying an extreme margin on the selected feature dimensions. The maximum criterion only enforces a large margin between the original classes without implications on the other foreign classes.

The design of the proposed combination schemes implies a natural order of the scores achievable for the individual features leading to more pronounced right-skewness of the minimum criterion than for the maximum criterion. In general, all combination schemes underestimate the real correlations between the feature values and the class labels of the original classes. Overestimation only occurs in rare cases. Higher biases and variances can be observed for the minimum strategy than for the maximum approach. Most top scores resulted in high original correlations for all combination schemes. Nevertheless, this definition might be dataset dependent.

Although individual top features for the minimum strategy look more promising than those of the maximum policy, better classification results were obtained for the later one. A reason might be the rarity of features that achieve high scores for the minimum strategy. Such features were mainly not observed for all class combinations. As we have chosen to construct signatures of the top k candidates also inferior features entered the selection. A threshold on the scores or other cut-off strategies might be more suitable (Yu and Príncipe 2019; François et al. 2007). High scores for the maximum approach were available for almost all class combinations leading to an overall better result. This reasoning could be a seeding point for the design of more sophisticated combination strategies implementing concepts from multi-objective optimization (Deb 2001), social choice theory (Chevaleyre et al. 2007) or rank aggregation (Burkovski et al. 2014).