Chained correlations for feature selection

Data-driven algorithms stand and fall with the availability and quality of existing data sources. Both can be limited in high-dimensional settings (n≫m\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \gg m$$\end{document}). For example, supervised learning algorithms designed for molecular pheno- or genotyping are restricted to samples of the corresponding diagnostic classes. Samples of other related entities, such as arise in differential diagnosis, are usually not utilized in this learning scheme. Nevertheless, they might provide domain knowledge on the background or context of the original diagnostic task. In this work, we discuss the possibility of incorporating samples of foreign classes in the training of diagnostic classification models that can be related to the task of differential diagnosis. Especially in heterogeneous data collections comprising multiple diagnostic categories, the foreign ones can change the magnitude of available samples. More precisely, we utilize this information for the internal feature selection process of diagnostic models. We propose the use of chained correlations of original and foreign diagnostic classes. This method allows the detection of intermediate foreign classes by evaluating the correlation between class labels and features for each pair of original and foreign categories. Interestingly, this criterion does not require direct comparisons of the initial diagnostic groups and therefore, might be suitable for settings with restricted data access.


Introduction
Data mining and machine learning are the key technologies for molecular pheno-and genotyping as required for personalized medicine (Kraus et al. 2018).As those techniques are designed for high-dimensional data, they extend the human capability of extracting and aggregating (molecular) patterns from these profiles of tens of thousands of measurements.Especially in medical applications, the resulting diagnostic models are required to be both accurate and interpretable.Human experts should be able to comprehend and intervene in automated decisions and their consequences.Furthermore, interpretable decision models also aid in generating hypotheses on the molecular causes or mechanisms of disease.
One of the most prominent techniques for improving the interpretability of highdimensional data is feature selection (Guyon and Elisseeff 2003).It constructs lowdimensional signatures of primary measurements selected from the original highdimensional profiles.Often these signatures are the basis for all subsequent processing steps and therefore the only interpretation of the final model.The chosen features describe the discriminative outline of the underlying molecular network.
Although they have great potential, the use of data mining and machine learning techniques can have limitations for molecular data.Due to ethical, economic or technical reasons, sample collections are typically limited in size leading to a high contrast of feature and sample numbers (n m).This imbalance causes various effects summarized under the issue of the curse of dimensionality (Bellman 1957).For example, the linear separability of m data point dichotomies increases with the dimensionality n (Cover 1965).Simultaneously, the Euclidean distances among data points become less distinguishable, which can affect the reliability of neighbourhood networks (Hinneburg et al. 2000).The imbalance mainly influences the possibility of precise and unique parameter estimation due to the high dimensionality of the search space (Bühlmann and van de Geer 2011).For classification models, the complexities of model classes increase with the dimensionality n (Kearns and Vazirani 1994), leading to overfitting and decreased generalization performance in high-dimensional settings.
Here, learning tasks can be improved by two primary strategies.The first one is to focus on classification models that are designed for operating on a relatively small set of samples.It comprises the selection of fast converging learning algorithms and the regulation of model complexity (Vapnik 1998).The second one is the acquisition of additional information and data sources for guiding a training process.This strategy of integrating domain knowledge comprises a broad spectrum of options that can be used for navigating through the search space.Knowledge of the relationships of diagnostic classes can outline their neighborhood (Lattke et al. 2015).Known interactions of univariate features can highlight multivariate processes (Taudien et al. 2016;Lausser et al. 2016a).Identified sources of noise might be counteracted a priori (Lausser et al. 2016b).
Additional data sources can be used to extract domain knowledge within the training process of a classifier.They might comprise additional samples of the original diagnostic classes as well as unlabeled samples or samples from different categories.While samples from the original classes might be seen as a traditional extension of the training data, the other two options lie beyond the scope of supervised learning.Unlabeled samples can be incorporated via partially supervised learning techniques, such as transductive or semi-supervised learning (Vapnik 1998;Chapelle et al. 2010;Lausser et al. 2014).Samples of different classes are utilized, for example, in transfer or multi-task learning approaches (Pan and Yang 2010;Caruana 1997).
In our previous work, we have systematically investigated the potential of transferring specific feature signatures from one molecular classification task to another (Lausser et al. 2018a).We have shown that multi-class classifier systems can utilize this strategy for achieving highly accurate multi-categorical predictions (Lausser et al. 2018b).In this work, we propose a correlation-based feature selection criteria for binary classification tasks that utilize foreign classes for extending their training sets.

Methods
In the following, we will view classification as the task of assigning an object to a class selected from a predefined set of distinct classes y ∈ Y according to a set of measurements (1) We restrict ourselves to binary classification tasks (|Y| = 2).The original classification function c is typically unknown.It has to be reconstructed in a data-driven learning procedure Here, C denotes an a priori chosen concept class and i=1 a set of labeled training examples.Subscript T will be omitted for simplicity.In the classical supervised scheme, it is assumed that the training set T comprises only samples related to the current classification task ∀i : y i ∈ Y.This assumption can be restrictive as data collections might also consist of additional samples of other related classes.In the following, we assume our original classification task to be embedded in a larger context comprising additional and distinct classes Y ⊃ Y as they are collected in multi-class classification tasks.Samples (x, y) ∈ T can be representative for each of these classes y ∈ Y .The training algorithm will be allowed to utilize all samples in T .If a subprocess requires only a subset of two classes a and b this will be denoted as (3) In this case, the original class labels {a, b} are replaced by labels {0, 1} for simplicity.
In this work, we propose to utilize the full training set T for the internal feature selection process of the classifier training.The restricted set T ab will be used for the final adaptation of the classification model.The trained classifier will later on be tested on an analogously restricted validation set V ab .

Feature selection
Especially in high-dimensional settings, the training procedure of a classification rule can incorporate a feature selection process excluding features that are believed to be noisy, uninformative or even misguiding for the adaptation of the classifier.This selection process is typically implemented as a data-driven procedure yielding at the selection of n ≤ n feature indices Subsequent training steps and the final classification model will operate on the reduced feature representation (5) In the following we concentrate on univariate feature selection strategies.That is each feature is assessed via a quality score s(i) that does not take into account interactions with other candidate features.The selection is based on a vector of quality scores s = s(1), . . ., s(n) T . (6) The top n features with the best scores are selected where rk s denotes the ranking function of the elements in s.

Foreign classes in feature selection
As we want to analyze the possibility of utilizing foreign classes for feature selection, the chosen score will not only be evaluated for the original pair of classes a and b but also for other classes o ∈ Y \ {a, b}.The corresponding set of scores (obtained for the ith feature) will be denoted by S(i).The cardinality of S(i) depends on the chosen strategy for selecting pairs of classes.For aggregating the scores in S(i), we will the utilize following three strategies: We utilize s for ∈ {s min , s mean , s max } to denote a general foreign feature selection strategy.A classical feature selection strategy is denoted as s orig .
Here, we construct S(i) from various scores based on the empirical Pearson correlation between an individual feature and class label.For the ith feature and a fixed pair of classes a, b ∈ Y, it is given by where x(i) denotes the observed mean value of the ith feature and ȳ denotes the average class label.
More precisely, we investigate a score based on the Pearson correlations of foreign classes o ∈ Y \ {a, b} to both original classes a and b, In the following we call this score chained correlation.The corresponding sets of scores are given by The score ccor T ao ,T ob (i) is high, if foreign class o fulfils two conditions simultaneously and high correlations are obtained when related to classes a and b.Note that o is the second class in cor T ao (i) and the first one in cor class o will lead to high correlations under opposite conditions.For cor T ao (i), high positive correlations are achieved if the values of class a are lower than those of class o.For cor T ob (i), the values of class b are required to be higher.
Top scores of ccor T ao ,T ob (i) are achieved, if the samples of class o (projected on the ith feature) lie in between the samples of classes a and b.A high value of ccor T ao ,T ob (i) implies that the values of class a are lower than those of class b and therefore indicate high values of cor T ab (i).The score ccor T ao ,T ob (i) can therefore be seen as a surrogate for cor T ab (i).As an analogous argumentation can be given for high negative correlations, we have chosen to consider the absolute value |ccor T ao ,T ob (i)| in our experiments.

Experiments
We evaluate S ccor using the aggregation schemes s for ∈ {s max , s mean , s min } in experiments with 9 multi-class datasets comprising multiple instances (m ≥ 59, |Y| ≥ 4).Each dataset was collected for a specific research question and is therefore analysed independently.This research question can be regarded as the common semantical (and biological) context of the classes Y.A summary of these datasets can be found in Table 1.All datasets consist of gene expression profiles (n ≥ 8740).Each feature corresponds to the expression level of a mRNA molecule of a biological sample.Within each dataset the biological samples were prepared according to identical laboratory protocols.
We perform experiments utilizing all pairs of classes a, b ∈ Y as original classes.For an individual dataset, the experimental setup therefore consists of |Y|(|Y|−1) 2 settings.As a reference score, the absolute Pearson correlation is chosen s orig = |cor T ab |.

Empirical characterization of chained correlations
In order to characterize the relations of the foreign aggregation schemes s for and the original score s orig we provide their empirical joint distributions over all n features gained on datasets d 1 − d 9 .A detailed example for all individual foreign scores S ccor and the foreign aggregation schemes s for is shown for dataset d 8 .Additionally we show examples for high scoring features of s for gained on datasets

Classification experiments
We also compare the classification accuracies a for gained by the use of foreign aggregation strategies s for to the

Empirical joint distributions of chained correlations
Fig. 3 Visualization of expression values of high scoring features for d 1 , . . ., d 9 using the aggregation strategy s max .For every dataset, the original classes y a , y b ∈ Y are shown at the top, the value of s orig is shown to the right.For every foreign class y c ∈ Y \ {y a , y b }, the expression values of this single class are shown on an axis and the value of s for to the right.The values of s for are sorted in descending order.For each dataset, the limits (minimum and maximum) of the expression values are shown at the bottom Panel (a) shows boxplots of these differences in fixed intervals of s for and panel (b) the corresponding fraction of features.It can be observed that for the majority of features s orig is underestimated by the use of s for .We quantify this observation by counting all features that over-and underestimate s orig .Over all datasets, s orig is underestimated by s min in 98.52% of all cases and overestimated in 1.47%.For s mean , an underestimation was observed for 92.24% of all experiments.s orig was overestimated in 7.76%.The aggregation scheme s max leads to an underestimation of s orig in 81.46% of all cases and to an overestimation in 18.54%.
Figure 3 shows the expression values of high scoring features for s max for each dataset d 1 − d 9 .It can be seen that for all these features an underestimation of s orig holds true.Differences from s max to s orig up to 0.21 (d 7 ) can be observed.In mean over all datasets, these differences are 0.11.Comparable figures for s min and s mean can be found in the supplement.

Evaluation of 10 × 10 cross-validation experiments
A comparison of the accuracies achieved in the 10 × 10 CV experiments is given in Fig. 4. All results are shown in triplets indicating the number of wins, ties and losses (w/t/l) for a specific dataset.Panel A shows the result for n = 25 features.Over all datasets, best results were achieved for s max .For the SVM, s max achieved better results for 37.71% of all experiments (t = 24.74%/l= 37.55%).For the 3NN it gained better results for 47.37% (t = 14.53%/l = 38.10%).For RF, it was better in 40.07% of all cases (t = 26.14%/l= 33.79%).
In Panel B the results for n = 50 features are given.Here, the ranking of the foreign scores is similar to the ranking observed for n = 25 features.Best results were gained for s max .The corresponding SVM was better for 40.92% of all experiments (t = 26.10%/l= 32.98%),3NN gained better results in 48.42% (t = 16.53%/l= 35.05%)and the RF outperformed its counterpart in 40.48% (t = 25.69%/l= 33.83%).

Discussion and conclusion
In this work, we analyzed the use of chained correlations for incorporating foreign classes in the feature selection processes of binary diagnostic tasks.Here, samples of the original diagnostic classes are related to those of a foreign class for each feature.The chained correlations link the first original class to the foreign class and the foreign class to the second original class.A high score indicates a central position of the foreign class and almost disjunct positions of the original ones.A chained correlation typically underestimates the correlation gained for the original classes.The first one, therefore, might be used as a surrogate marker for the second one.
However, it might be unclear which foreign classes should be used.In our experiments, the original classes, as well as the foreign classes, were selected from the same multi-class data collection.The underlying biological samples were collected for a specific scientific purpose and preprocessed according to identical laboratory protocols.They are therefore comparable and share common semantical (or biological) context.These constraints might be too strict as we mainly screen for (feature-wise) central classes that imply well separable categories.Only the original diagnostic classes require an identical preprocessing to guarantee the correctness of implications from high chained correlations.However, we assume that both constraints increase the probability of detecting foreign feature-wise central classes.
We proposed three scores for combining chained correlations of multiple foreign classes.The minimum, mean or maximum score are discussed which consider a different number of foreign classes.While a high minimum score requires all foreign classes to lie in between the original classes, only one central foreign class is required for a high maximum score.The minimum criterion, therefore, models the original classes as outlying classes implying an extreme margin on the selected feature dimensions.The maximum criterion only enforces a large margin between the original classes without implications on the other foreign classes.
The design of the proposed combination schemes implies a natural order of the scores achievable for the individual features leading to more pronounced rightskewness of the minimum criterion than for the maximum criterion.In general, all combination schemes underestimate the real correlations between the feature values and the class labels of the original classes.Overestimation only occurs in rare cases.Higher biases and variances can be observed for the minimum strategy than for the maximum approach.Most top scores resulted in high original correlations for all combination schemes.Nevertheless, this definition might be dataset dependent.
Although individual top features for the minimum strategy look more promising than those of the maximum policy, better classification results were obtained for the later one.A reason might be the rarity of features that achieve high scores for the minimum strategy.Such features were mainly not observed for all class combinations.As we have chosen to construct signatures of the top k candidates also inferior features entered the selection.A threshold on the scores or other cut-off strategies might be more suitable (Yu and Príncipe 2019;François et al. 2007).High scores for the maximum approach were available for almost all class combinations leading to an overall better result.This reasoning could be a seeding point for the design of more sophisticated combination strategies implementing concepts from multiobjective optimization (Deb 2001), social choice theory (Chevaleyre et al. 2007) or rank aggregation (Burkovski et al. 2014).

Figure 1
Figure 1 shows an example of the individual and aggregated chained correlations for d 8 .An overview for all datasets is given in Fig. 2. The differences s orig −s for are shown.The highest scores for s for occur with s max .Over all datasets, the average ranges are s max ∈ [0.00, 0.84], s mean ∈ [0.00, 0.74], s min ∈ [0.00, 0.67].

Fig. 1
Fig. 1 Comparison of the original score s orig and the foreign scores s for for dataset d 8 .Each scatterplot shows the profile of all n = 8740 features of d 8 .a = y 1 and b = y 5 were chosen as original classes.The remaining classes were used as foreign classes o 1 = y 2 , o 2 = y 3 , o 3 = y 4 .The first row shows the individual foreign scores s o for o 1 − o 3 .The second row gives the aggregated scores s min , s mean , s max

Fig. 2
Fig. 2 Differences of the original score and the foreign scores s orig − s for .The figure gives the empirical differences between the original score s orig and the aggregated foreign scores s min , s mean , s max observed for each dataset d 1 − d 9 .The results for the whole feature profiles and for all pairs of original classes a, b ∈ Y are aggregated for each dataset.The foreign scores s for were discretized in intervals of width 0.1.Panel (a) gives the ranges of differences in dependency of s for .Panel (b) can be seen as a histogram showing the percentage of features in a specific interval of s for .Each bar is split in the percentage of features that overor underestimated the original score s orig

Table 1 Table of
The number of classes |Y|, features n, samples m and samples per class m i are reported Panels showing how classifiers using s for compare to the same classifiers using s orig in 10 × 10 cross-validation experiments.For n = 25 (panel A) and n = 50 (panel B), stacked barplots are depicted according to the aggregation strategies s for ∈ {s min , s mean , s max }.Rows indicate the datasets d 1 , . . ., d 9 and columns the classification algorithms SVM, 3NN and RF.Each individual barplot shows how often a classifier utilizing s for gains a better, equal or lower accuracy a for compared to its counterpart a orig utilizing s orig .Under each clasification algorithm, the mean values over all comparisons are presented gained better results in 36.93% of all cases (t = 11.35%/l= 51.72%).The RF was better in 32.01% of all settings (t = 26.05%/l= 41.94%).