In the following, we will view classification as the task of assigning an object to a class selected from a predefined set of distinct classes \(y \in \mathcal {Y}\) according to a set of measurements \(\mathbf {x}\in \mathcal {X} \subseteq \mathbb {R}^{n}\)
$$\begin{aligned} c:\mathcal {X} \longrightarrow \mathcal {Y}. \end{aligned}$$
(1)
We restrict ourselves to binary classification tasks (\(|\mathcal {Y}|=2\)). The original classification function c is typically unknown. It has to be reconstructed in a data-driven learning procedure
$$\begin{aligned} l: \mathcal {C} \times \mathcal {T} \rightarrow c_{\mathcal {T}} \in \mathcal {C}. \end{aligned}$$
(2)
Here, \(\mathcal {C}\) denotes an a priori chosen concept class and \(\mathcal {T}=\{(\mathbf {x}_{i}, y_{i})\}_{i=1}^{|\mathcal {T}|}\) a set of labeled training examples. Subscript \(_{\mathcal {T}}\) will be omitted for simplicity. In the classical supervised scheme, it is assumed that the training set \(\mathcal {T}\) comprises only samples related to the current classification task \(\forall i : y_{i} \in \mathcal {Y}\). This assumption can be restrictive as data collections might also consist of additional samples of other related classes. In the following, we assume our original classification task to be embedded in a larger context comprising additional and distinct classes \(\mathcal {Y}' \supset \mathcal {Y}\) as they are collected in multi-class classification tasks. Samples \((\mathbf {x},y)\in \mathcal {T}\) can be representative for each of these classes \(y \in \mathcal {Y}'\). The training algorithm will be allowed to utilize all samples in \(\mathcal {T}\). If a subprocess requires only a subset of two classes a and b this will be denoted as
$$\begin{aligned} \mathcal {T}_{ab} = \bigl \{(\mathbf {x},\mathbb {I}_{\left[ y=b\right] }) \mid (\mathbf {x},y)\in \mathcal {T}, y\in \{a,b\} \bigr \}. \end{aligned}$$
(3)
In this case, the original class labels \(\left\{ a,b\right\} \) are replaced by labels \(\left\{ 0,1\right\} \) for simplicity.
In this work, we propose to utilize the full training set \(\mathcal {T}\) for the internal feature selection process of the classifier training. The restricted set \(\mathcal {T}_{ab}\) will be used for the final adaptation of the classification model. The trained classifier will later on be tested on an analogously restricted validation set \(\mathcal {V}_{ab}\).
Feature selection
Especially in high-dimensional settings, the training procedure of a classification rule can incorporate a feature selection process excluding features that are believed to be noisy, uninformative or even misguiding for the adaptation of the classifier. This selection process is typically implemented as a data-driven procedure yielding at the selection of \(\hat{n} \le n\) feature indices
$$\begin{aligned} \mathbf {i} \in \mathcal {I} = \{\mathbf {i} \in \mathbb {N}^{\hat{n} \le n} \, | \, i_{k-1} < i_{k}, \, 1 \le i_{k} \le n\}. \end{aligned}$$
(4)
Subsequent training steps and the final classification model will operate on the reduced feature representation
$$\begin{aligned} \mathbf {x}^{(i)}=(x^{(i^{(1)})}, \ldots , x^{(i^{(\hat{n})})})^{T}. \end{aligned}$$
(5)
In the following we concentrate on univariate feature selection strategies. That is each feature is assessed via a quality score s(i) that does not take into account interactions with other candidate features. The selection is based on a vector of quality scores
$$\begin{aligned} \mathbf {s} = \bigl (s(1),\ldots ,s(n)\bigr )^{T}. \end{aligned}$$
(6)
The top \(\hat{n}\) features with the best scores are selected
$$\begin{aligned} \mathbf {i} = \mathrm {top}_{\hat{n}}(\mathbf {s}) = \bigl \{ (\ldots ,i,\ldots )^{T} \mid \mathrm {rk}_{\mathbf {s}}(s(i)) \le \hat{n} \bigr \}, \end{aligned}$$
(7)
where \(\mathrm {rk}_{\mathbf {s}}\) denotes the ranking function of the elements in \(\mathbf {s}\).
Foreign classes in feature selection
As we want to analyze the possibility of utilizing foreign classes for feature selection, the chosen score will not only be evaluated for the original pair of classes a and b but also for other classes \(o \in \mathcal {Y}\setminus \{a,b\}\). The corresponding set of scores (obtained for the ith feature) will be denoted by \(\mathcal {S}(i)\). The cardinality of \(\mathcal {S}(i)\) depends on the chosen strategy for selecting pairs of classes. For aggregating the scores in \(\mathcal {S}(i)\), we will the utilize following three strategies:
$$\begin{aligned} s_{\mathrm {max}}(i)= & {} \mathrm {max} \bigl (\mathcal {S}(i) \bigr ), \end{aligned}$$
(8)
$$\begin{aligned} s_{\mathrm {mean}}(i)= & {} \mathrm {mean} \bigl (\mathcal {S}(i) \bigr ), \end{aligned}$$
(9)
$$\begin{aligned} s_{\mathrm {min}}(i)= & {} \mathrm {min} \bigl (\mathcal {S}(i) \bigr ). \end{aligned}$$
(10)
We utilize \(s_{\mathrm {for}}\in \{s_{\mathrm {min}},s_{\mathrm {mean}},s_{\mathrm {max}}\}\) to denote a general foreign feature selection strategy. A classical feature selection strategy is denoted as \(s_{\mathrm {orig}}\).
Here, we construct \(\mathcal {S}(i)\) from various scores based on the empirical Pearson correlation between an individual feature and class label. For the ith feature and a fixed pair of classes \(a,b \in \mathcal {Y}\), it is given by
$$\begin{aligned} cor_{\mathcal {T}_{ab}}(i) = \frac{\frac{1}{|\mathcal {T}_{ab}|-1}\sum _{(\mathbf {x},y)\in \mathcal {T}_{ab}} (x^{(i)}-\bar{x}^{(i)}) (y-\bar{y})}{\sqrt{\frac{1}{|\mathcal {T}_{ab}|-1}\sum _{(\mathbf {x},y)\in \mathcal {T}_{ab}} (x^{(i)}-\bar{x}^{(i)})^2}\sqrt{\frac{1}{|\mathcal {T}_{ab}|-1}\sum _{(\mathbf {x},y)\in \mathcal {T}_{ab}} (y-\bar{y})^2}},\quad \end{aligned}$$
(11)
where \(\bar{x}^{(i)}\) denotes the observed mean value of the ith feature and \(\bar{y}\) denotes the average class label.
More precisely, we investigate a score based on the Pearson correlations of foreign classes \(o \in \mathcal {Y}\setminus \{a,b\}\) to both original classes a and b,
$$\begin{aligned} \mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i) = \frac{\mathrm {cor}_{\mathcal {T}_{ao}}(i) + \mathrm {cor}_{\mathcal {T}_{ob}}(i)}{2} \in \left[ -1;+1 \right] . \end{aligned}$$
(12)
In the following we call this score chained correlation. The corresponding sets of scores are given by
$$\begin{aligned} \mathcal {S}_{ccor}(i) = \Bigl \{ |\mathrm {ccor}_{\mathcal {T}_\mathrm {ao},\mathcal {T}_\mathrm {ob}}(i) | \, \Big | \, o \in \mathcal {Y}\setminus \{a,b\} \Bigr \}. \end{aligned}$$
(13)
The score \(\mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i)\) is high, if foreign class o fulfils two conditions simultaneously and high correlations are obtained when related to classes a and b. Note that o is the second class in \(\mathrm {cor}_{\mathcal {T}_{ao}}(i)\) and the first one in \(\mathrm {cor}_{\mathcal {T}_{ob}}(i)\). As
$$\begin{aligned} \forall \, c,d \in \mathcal {Y}, c\ne d: \mathrm {cor}_{\mathcal {T}_{cd}}(i) = -\mathrm {cor}_{\mathcal {T}_{dc}}(i), \end{aligned}$$
(14)
class o will lead to high correlations under opposite conditions.
For \(\mathrm {cor}_{\mathcal {T}_{ao}}(i)\), high positive correlations are achieved if the values of class a are lower than those of class o. For \(\mathrm {cor}_{\mathcal {T}_{ob}}(i)\), the values of class b are required to be higher.
Top scores of \(\mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i)\) are achieved, if the samples of class o (projected on the ith feature) lie in between the samples of classes a and b. A high value of \(\mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i)\) implies that the values of class a are lower than those of class b and therefore indicate high values of \(\mathrm {cor}_{\mathcal {T}_{ab}}(i)\). The score \(\mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i)\) can therefore be seen as a surrogate for \(\mathrm {cor}_{\mathcal {T}_{ab}}(i)\). As an analogous argumentation can be given for high negative correlations, we have chosen to consider the absolute value \(|\mathrm {ccor}_{\mathcal {T}_{ao},\mathcal {T}_{ob} }(i)|\) in our experiments.