1 Introduction

Learning from data is one of the central goals of machine learning research. Statistical and data-mining communities have long focused on building simpler and more interpretable models for prediction and understanding of data. However, high dimensional data present unique computational challenges such as model over-fitting, computational intractability, and poor prediction. Feature selection is a dimensionality reduction technique that helps to simplify learning, reduce cost, and improve data interpretability.

Existing approaches Over the years, feature selection methods have evolved from simplest univariate ranking algorithms to more sophisticated relevance-redundancy trade-off to interaction-based approach in recent times. Univariate feature ranking (Lewis 1992) is a feature selection approach that ranks features based on relevance and ignores redundancy. As a result, when features are interdependent, the ranking approach leads to sub-optimal results (Brown et al. 2012). Redundancy-based approach improves over the ranking approach by considering both relevance and redundancy in the feature selection process. A wide variety of feature selection methods are based on relevance-redundancy trade-off (Battiti 1994; Hall 2000; Yu and Liu 2004; Peng et al. 2005; Ding and Peng 2005; Senawi et al. 2017). Their goal is to find an optimal subset of features that produces maximum relevance and minimum redundancy.

Complementarity-based feature selection methods emerged as an alternative approach to account for feature complementarity in the selection process. Complementarity can be described as a phenomenon in which two features together provide more information about the target variable than the sum of their individual information (information synergy). Several complementarity-based methods are proposed in the literature (Yang and Moody 1999, 2000; Zhao and Liu 2007; Meyer et al. 2008; Bontempi and Meyer 2010; Zeng et al. 2015; Chernbumroong et al. 2015). Yang and Moody (1999) and later Meyer et al. (2008) propose an interactive sequential feature selection method, known as joint mutual information (JMI), which selects a candidate feature that maximizes relevance and complementarity simultaneously. They conclude that JMI approach provides the best trade-off in terms of accuracy, stability, and flexibility with small data samples.

Limitations of the existing methods Clearly, redundancy-based approach is less efficient than complementarity-based approach as it does not account for feature complementarity. However, the main criticism of redundancy-based approach is with regard to how redundancy is formalized and measured. Correlation is the most common way to measure redundancy between features, which implicitly assumes that all correlated features are redundant. However, Guyon and Elisseeff (2003), Gheyas and Smith (2010) and Brown et al. (2012) show that this is an incorrect assumption; correlation does not imply redundancy, nor absence of complementarity. This is evident in Figs. 1 and 2, which present a 2-class classification problem (denoted by star and circle) with two continuous features \(X_{1}\) and \(X_{2}\). The projections on the axis denote the relevance of each respective feature. In Fig. 1, \(X_{1}\) and \(X_{2}\) are perfectly correlated, and indeed redundant. Having both \(X_{1}\) and \(X_{2}\) leads to no significant improvement in class separation compared to having either \(X_{1}\) or \(X_{2}\). However, in Fig. 2, a perfect separation is achieved by \(X_{1}\) and \(X_{2}\) together, although they are (negatively) correlated (within each class), and have identical relevance as in Fig. 1. This shows that while generic dependency is undesirable, dependency that conveys class information is useful. Whether two interacting features are redundant or complimentary depends on the relative magnitude of their class conditional dependency in comparison to their unconditional dependency. However, in the redundancy-based approach, it only focuses on unconditional dependency, and try to minimize it without exploring whether such dependencies lead to information gain or loss.

Although complementarity-based approach overcomes some of the drawbacks of redundancy based approach, it has limitations, too. Most of the existing complementarity-based feature selection methods either adopt a sequential feature selection approach or evaluate subsets of a given size. Sequential feature selection method raises a few important issues. First, Zhang and Zhang (2012) show that sequential selection methods suffer from initial selection bias, i.e., the feature selected in the earlier steps govern acceptance or rejection of the subsequent features at each iteration and not vice versa. Second, it requires a priori knowledge of the desired number of features to be selected or some stopping criterion, which is usually determined by expert information or some technical considerations such as scalability or computation time. In many practical situations, it is difficult to get such prior knowledge. Moreover, in many cases, we are interested in finding an optimal feature subset that gives maximum predictive accuracy for a given task, and we are not really concerned about their ranking or the size of the optimal subset. Third, most of the existing methods combine the redundancy and complementarity and consider the net or aggregate effect in the search process (note that complementarity has the opposite sign of redundancy). Bontempi and Meyer (2010) refer to this aggregate approach as ‘implicit’ consideration of complementarity.

Fig. 1
figure 1

Perfectly correlated and redundant

Fig. 2
figure 2

Features are negatively-correlated within class, yet provide perfect separation

Our approach In this paper, we propose a filter-based feature subset selection method based on relevance, redundancy, and complementarity. Unlike most of the existing methods, which focus on feature ranking or compare subsets of a given size, our goal is to select an optimal subset of features that predicts well. This is useful in many situations, where no prior expert knowledge is available regarding size of an optimal subset or the goal is simply to find an optimal subset. Using a multi-objective optimization (MOO) technique and an adaptive cost function, the proposed method aims to (1) maximize relevance, (2) minimize redundancy, and (3) maximize complementarity, while keeping the subset size as small as possible.

The term ‘adaptive’ implies that our proposed method adaptively determines the trade-off between relevance, redundancy, and complementarity based on subset properties. An adaptive approach helps to overcome the limitations of a fixed policy that fails to model the trade-off between competing objectives appropriately in a MOO problem. Such an adaptive approach is new to feature selection and essentially mimics a feature feedback mechanism in which the trade-off rule is a function of the objective values. The proposed cost function is also flexible in that it does not assume any particular functional form or rely on concavity assumption, and uses implicit utility maximization principles (Roy 1971; Rosenthal 1985).

Unlike some of the complementarity-based methods, which consider the net (aggregate) effect of redundancy and complementarity, we consider ‘redundancy minimization’ and ‘complementarity maximization’ as two separate objectives in the optimization process. This allows us the flexibility to apply different weights (preference) to redundancy and complementarity and control their relative importance adaptively during the search process. Using best-first as search strategy, the proposed heuristic offers a “best compromise” solution (more likely to avoid local optimum due to interactively determined gradient), if not the “best solution (in the sense of optimum)” (Saska 1968), which is sufficiently good in most practical scenarios. Using benchmark datasets, we show empirically that our adaptive heuristic not only outperforms many redundancy-based methods, but is also competitive amongst the existing complementarity-based methods.

Structure of the paper The rest of the paper is organized as follows. Section 2 presents the information-theoretic definitions and the concepts of relevance, redundancy, and complementarity. In Sect. 3, we present the existing feature selection methods, and discuss their strengths and limitations. In Sect. 4, we describe the proposed heuristic, and its theoretical motivation. In this section, we also discuss the limitations of our heuristic, and carry out sensitivity analysis. Section 5 presents the algorithm for our proposed heuristic, and evaluates its time complexity. In Sect. 6, we assess the performance of the heuristic on two synthetic datasets. In Sect. 7, we validate our heuristic using real data sets, and present the experimental results. In Sect. 8, we summarize and conclude.

2 Information theory: definitions and concepts

First, we provide the necessary definitions in information theory (Cover and Thomas 2006) and then discuss the existing notions of relevance, redundancy, and complementarity.

2.1 Definitions

Suppose, X and Y are discrete random variables with finite state spaces \({\mathscr {X}}\) and \({\mathscr {Y}}\), respectively. Let \(p_{X, Y}\) denote the joint probability mass function (PMF) of X and Y, with marginal PMFs \(p_X\) and \(p_Y\).

Definition 1

(Entropy) Entropy of X, denoted by H(X), is defined as follows: \(H(X)= - \sum \limits _{x \in {\mathscr {X}}}^{} p_{X}(x) \log (p_{X}(x))\). Entropy is a measure of uncertainty in PMF \(p_{X}\) of X.

Definition 2

(Joint entropy) Joint entropy of X and Y, denoted by H(XY), is defined as follows: \(H(X,Y)= - \sum \limits _{x \in {\mathscr {X}}}^{} \sum \limits _{y \in {\mathscr {Y}}}^{} p_{X,Y}(x,y) \log (p_{X,Y}(x,y))\). Joint entropy is a measure of uncertainty in the joint PMF \(p_{X,Y}\) of X and Y.

Definition 3

(Conditional entropy) Conditional entropy of X given Y, denoted by H(X|Y), is defined as follows: \(H(X|Y) = - \sum \limits _{x \in {\mathscr {X}}}^{} \sum \limits _{y \in {\mathscr {Y}}}^{} p_{X,Y}(x,y) \log (p_{X|y}(x))\), where \( p_{X|y}(x)\) is the conditional PMF of X given \(Y=y\). Conditional entropy H(X|Y) measures the remaining uncertainty in X given the knowledge of Y.

Definition 4

(Mutual information (MI)) Mutual information between X and Y, denoted by I(XY), is defined as follows: \(I(X;Y)= H(X)-H(X|Y) =H(Y)-H(Y|X)\) . Mutual information measures the amount of dependence between X and Y. It is non-negative, symmetric, and is equal to zero iff X and Y are independent.

Definition 5

(Conditional mutual information) Conditional mutual information between X and Y given another discrete random variable Z, denoted by I(XY|Z), is defined as follows: \(I(X;Y| Z)=H(X| Z) - H(X| Y,Z) = H(Y| Z) - H(Y| X,Z)\). It measures the conditional dependence between X and Y given Z.

Definition 6

(Interaction information) Interaction information (McGill 1954; Matsuda 2000; Yeung 1991) among X, Y, and Z, denoted by I(XYZ), is defined as follows: \( I(X;Y;Z)=I(X;Y) -I(X;Y|Z)\).Footnote 1 Interaction information measures the change in the degree of association between two random variables by holding one of the interacting variables constant. It can be positive, negative, or zero depending on the relative order of magnitude of I(XY) and I(XY|Z). Interaction information is symmetric (order independent). More generally, the interaction information among a set of n random variables \( \mathbf{X } = \{X_{1},X_{2},\ldots ,X_{n}\}\), is given by \(I(X_{1};X_{2};\ldots ;X_{n}) = -\sum \limits _{\mathbf{S } \in \mathbf{X }'}^{} (-1)^{|\mathbf{S } \mid } H(\mathbf{S })\) where, \(\mathbf{X }'\) is the superset of \(\mathbf{X }\) and \(\sum \) denotes the sum over all subsets \(\mathbf{S }\) of the superset \(\mathbf{X }'\) (Abramson 1963). Should it be zero, we say that features do not interact ‘altogether’ (Kojadinovic 2005).

Definition 7

(Multivariate mutual information) Multivariate mutual information (Kojadinovic 2005; Matsuda 2000) between a set of n features \(\mathbf{X } = \{X_{1},X_{2},\ldots ,X_{n}\}\) and Y, denoted as follows: \(I(\mathbf{X };Y)\), is defined by \(I(\mathbf{X };Y)= \sum \limits _{i}^{}I(X_{i};Y) - \sum \limits _{i < j}^{} I(X_{i}; X_{j};Y) + \dots +(-1)^ {n+1} \,I(X_{1};\dots ;X_{n};Y) \). This is the möbius representation of multivariate mutual information based on set theory. Multivariate mutual information measures the information that \(\mathbf{X }\) contains about Y and can be seen as a series of alternative inclusion and exclusion of higher-order terms that represent the simultaneous interaction of several variables.

Definition 8

(Symmetric uncertainty) Symmetric uncertainty (Witten et al. 2016) between X and Y, denoted by SU(XY), is defined as follows: \(SU(X,Y) = \frac{2\,I(X;Y)}{H(X)+H(Y)}\). Symmetric uncertainty is a normalized version of MI in the range [0, 1]. Symmetric uncertainty can compensate for MI’s bias towards features with more values.

Definition 9

(Conditional symmetric uncertainty) Conditional symmetric uncertainty between X and Y given Z, denoted by SU(XY|Z), is defined as follows: \(SU(X,Y|Z) = \frac{2\,I(X;Y|Z)}{H(X|Z)+ H(Y|Z)}\). SU(XY|Z) is a normalized version of conditional mutual information I(XY|Z).

From Definitions 8 and  9, the symmetric uncertainty equivalent of interaction information can be expressed as follows: \(SU(X,Y,Z)=SU(X,Y)-SU(X,Y|Z)\). Using the above notations, we can formulate the feature selection problem as follows: Given a set of n features \(\mathbf{F } =\{F_{i}\}_{i \in \{1,\ldots ,\, n\}}\), the goal of feature selection to select a subset of features \(\mathbf{F }_{S}= \{F_{i}: i \in S\}\)\(S \subseteq \{1,2,\ldots , n\}\) such that \(\mathbf{F }_{S} = \arg \max \limits _{S} \, I(\mathbf{F }_{S};Y)\), where \(I(\mathbf{F }_{S};Y)\) denotes mutual information between \(F_{S}\) and the class variable Y. For tractability reasons and unless there is strong evidence for the existence of higher-order interaction, the correction terms beyond 3-way interaction are generally ignored in the estimation of multivariate mutual information. In this paper, we will use

$$\begin{aligned} I(\mathbf{F }_{S};Y) \approx \sum \limits _{i \in S}^{}I(F_{i};Y)- \sum \limits _{i,j \in S , i < j }^{} I(F_{i}; F_{j};Y) \end{aligned}$$
(1)

where \(I(F_{i}; F_{j}; Y)\) is the 3-way interaction term between \(F_{i}\), \(F_{j}\), and Y. The proof of Eq. 1 for a 3-variable case can be shown easily using Venn diagram shown in Fig. 3. The n variable case can be shown by recursive computation of the 3-variable case.

$$\begin{aligned} \begin{aligned} I(F_{1},F_{2};Y)&= I(F_{1};Y) + I(F_{2};Y|F_{1})\\&= I(F_{1};Y) + I(F_{2};Y)- I(F_{1};F_{2};Y) \end{aligned} \end{aligned}$$
(2)
Fig. 3
figure 3

Venn diagram showing the interaction between features \(F_{1}\), \(F_{2}\), and class Y

Fig. 4
figure 4

\(X_{2}\) is individually irrelevant but improves the separability of \(X_{1}\)

Fig. 5
figure 5

Both individually irrelevant features become relevant together

2.2 Relevance

Relevance of a feature signifies its explanatory power, and is a measure of feature worthiness. A feature can be relevant individually or together with other variables if it carries information about the class Y. It is also possible that an individually relevant feature becomes relevant or a relevant feature becomes irrelevant when other features are present. This can be shown using Figs. 4 and 5, which present a 2-class classification problem (denoted by star and circle) with two continuous features \(X_{1}\) and \(X_{2}\). The projections of the class on each axis denotes each feature’s individual relevance. In Fig. 4, \(X_{2}\), which is individually irrelevant (uninformative), becomes relevant in the presence of \(X_{1}\) and together improve the class separation that is otherwise achievable by \(X_{1}\) alone. In Fig. 5, both \(X_{1}\) and \(X_{2}\) are individually irrelevant, however provide a perfect separation when present together (“chessboard problem,” analogous to XOR problem). Thus relevance of a feature is context dependent (Guyon and Elisseeff 2003).

Using information-theoretic framework, a feature \(F_{i}\) is said to be unconditionally relevant to the class variable Y if \(I(F_{i};Y) > 0\), and irrelevant if \(I(F_{i};Y) = 0\). When evaluated in the context of other features, we call \(F_{i}\) to be conditionally relevant if \(I(F_{i};Y|F_{S-i})> 0\), where \(F_{S-i}= F_{S}\setminus F_{i}\). There are several other probabilistic definitions of relevance available in the literature. Most notably, Kohavi and John (1997) formalize relevance in terms of an optimal Bayes classifier and propose 2\(^{\circ }\) of relevance\(\textemdash \)strong and weak. Strongly relevant features are those that bring unique information about the class variable and can not be replaced by other features. Weakly relevant features are relevant but not unique in the sense that they can be replaced by other features. An irrelevant feature is one that is neither strong nor weakly relevant.

2.3 Redundancy

The concept of redundancy is associated with the degree of dependency between two or more features. Two variables are said to be redundant, if they share common information about each other. This is the general dependency measured by \(I(F_{i}; F_{j})\). McGill (1954) and Jakulin and Bratko (2003) formalize this notion of redundancy as multi-information or total correlation. The multi-information between a set of n features \(\{F_{1},\ldots ,F_{n}\}\) is given by \(R(F_{1},\dots ,F_{n}) = \sum \limits _{i=1}^{n} H(F_{i}) - H(F_{1},\dots ,F_{n})\). For \(n=2\), \(R(F_{1},F_{2})= H(F_{1}) + H(F_{2})- H(F_{1},F_{2})= I(F_{1};F_{2})\). This measure of redundancy is non-linear, non-negative and non-decreasing with the number of features. In the context of feature election, it is often of interest to know whether two features are redundant with respect to the class variable, more than whether they are mutually redundant. Two features \(F_{i}\) and \(F_{j}\) are said to be redundant with respect to the class variable Y, if \(I(F_{i}, F_{j}; Y) < I(F_{i}; Y) + I(F_{j}; Y)\). From Eq. 2, it follows \(I(F_{i}; F_{j}; Y)>0\) or \(I(F_{i};F_{j}) > I(F_{i};F_{j}| Y)\). Thus two features are redundant with respect to the class variable if their unconditional dependency exceeds their class-conditional dependency.

2.4 Complementarity

Complementarity, known as information synergy, is the beneficial effect of feature interaction where two features together provide more information than the sum of their individual information. Two features \(F_{i}\) and \(F_{j}\) are said to be complementary with respect to the class variable Y if \(I(F_{i},F_{j}; Y) > I(F_{i}; Y) +I(F_{j}; Y)\), or equivalently, \(I(F_{i};F_{j}) < I(F_{i}; F_{j}| Y)\). Complementarity is negative interaction information. While generic dependency is undesirable, the dependency that conveys class information is good. Different researchers have explained complementarity from different perspectives. Vergara and Estévez (2014) define complementarity in terms of the degree of interaction between an individual feature \(F_{i}\) and the selected feature subset \(F_{S}\) given the class Y, i.e., \(I(F_{i},F_{S}|Y)\). Brown et al. (2012) provide similar definition to complementarity but call it conditional redundancy. They come to the similar conclusion as (Guyon and Elisseeff 2003): ‘the inclusion of the correlated features can be useful, provided the correlation within the class is stronger than the overall correlation.’

3 Related literature

In this section, we review filter-based feature selection methods, which use information gain as a measure of dependence. In terms of evaluation strategy, filter-based methods can be broadly classified into (1) redundancy-based approach, and (2) complementarity-based approach depending on whether or not they account for feature complementarity in the selection process. Brown et al. (2012) however show that both these approaches can be subsumed in a more general, unifying theoretical framework known as conditional likelihood maximization.

3.1 Redundancy-based methods

Most feature selection algorithms in the 1990s and early 2000 focus on relevance and redundancy to obtain the optimal subset. Notable amongst them are (1) mutual information based feature selection (MIFS) (Battiti 1994), (2) correlation based feature selection (CFS) (Hall 2000), (3) minimum redundancy maximum relevance (mRMR) (Peng et al. 2005), (4) fast correlation based filter (FCBF) (Yu and Liu 2003), (5) ReliefF (Kononenko 1994), and (6) conditional mutual information maximization (CMIM) (Fleuret 2004; Wang and Lochovsky 2004). With some variation, their main goal is to maximize relevance and minimize redundancy. Of these methods, MIFS, FCBF, ReliefF, and CMIM are potentially feature ranking algorithms. They rank the features based on certain information maximization criterion (Duch 2006) and select the top k features, where k is decided a priori based on expert knowledge or technical considerations.

MIFS is a sequential feature selection algorithm, in which a candidate feature \(F_{i}\) is selected that maximizes the conditional mutual information \(I(F_{i}; Y|F_{S})\). Battiti (1994) approximates this MI by \(I(F_{i}; Y|F_{S}) = I(F_{i}; Y)- \beta \sum _{F_{j} \in F_{S}}^{} I(F_{i};F_{j})\), where, \(F_{S}\) is an already selected feature subset, and \(\beta \in [0,1]\) is a user-defined parameter that controls the redundancy. For \(\beta = 0\), it reduces to a ranking algorithm. Battiti (1994) finds \(\beta \in [0.5,1]\) is appropriate for many classification tasks. Kwak and Choi (2002) show that when \(\beta =1\), MIFS method penalizes redundancy too strongly, and for this reason does not work well for non-linear dependence.

CMIM implements an idea similar to MIFS, but differs in the way in which \(I(F_{i}; Y |F_{S})\) is estimated. CMIM selects the candidate feature \(F_{i}\) that maximizes \(\min \limits _{F_{j}\in F_{S}} \mathrm I(F_{i}; Y |F_{j})\). Both MIFS and CMIM are incremental forward search methods, and they suffer from initial selection bias (Zhang and Zhang 2012). For example, if \(\{F_{1},F_{2}\}\) is the selected subset and \(\{F_{3},F_{4}\}\) is the candidate subset, the CMIM selects \(F_{3}\) if \(I(F_{3};Y |\{F_{1},F_{2}\}) > I(F_{4};Y |\{F_{1},F_{2}\})\) and the new optimal subset becomes \(\{F_{1},F_{2},F_{3}\}\). The incremental search only evaluates the redundancy between the candidate feature \(F_{3}\) and \(\{F_{1},F_{2}\}\) i.e. \(I(F_{3};Y |\{F_{1},F_{2}\})\) and never considers the redundancy between \(F_{1}\) and \(\{F_{2},F_{3}\}\) i.e. \(I(F_{1};Y |\{F_{2},F_{3}\})\).

CFS and mRMR are both subset selection algorithms, which evaluate a subset of features using an implicit cost function that simultaneously maximizes relevance and minimizes redundancy. CFS evaluates a subset of features based on pairwise correlation measures, in which correlation is used as a generic measure of dependence. CFS uses the following heuristic to evaluate a subset of features: \(merit (S)= \frac{k\, {\bar{r}}_{cf}}{\sqrt{k + k\,(k-1)\,{\bar{r}}_{ff}}}\), where k denotes the subset size, \({\bar{r}}_{cf}\) denotes the average feature-class correlation, and \({\bar{r}}_{ff}\) denotes the average feature-feature correlation of features in the subset. The feature-feature correlation is used as a measure of redundancy, and feature-class correlation is used as a measure of relevance. The goal of CFS is to find a subset of independent features that are uncorrelated and predictive of the class. CFS ignores feature complementarity, and cannot identify strongly interacting features such as in parity problem (Hall and Holmes 2003).

mRMR is very similar to CFS in principle, however, instead of correlation measures, mRMR uses mutual information \(I(F_{i}; Y)\) as a measure of relevance, and \(I(F_{i}; F_{j})\) as a measure of redundancy. mRMR uses the following heuristic to evaluate a subset of features: \(score(S) = \frac{\sum _{i \in S} I(F_{i}; Y)}{k}-\frac{\sum \nolimits _{i,j \in S} I(F_{i};F_{j})}{k^2}\). mRMR method suffers from limitations similar to CFS. Gao et al. (2016) show that the approximations made by the information-theoretic methods, such as mRMR and CMIM, are based on unrealistic assumptions and they introduce a novel set of assumptions based on variational distributions and derive novel algorithms with competitive performance.

FCBF follows a 2-step process. In step 1, it ranks all features based on symmetric uncertainty between each feature and the class variable, i.e., \(SU(F_{i},Y)\), and selects the relevant features that exceed a given threshold value \(\delta \). In step 2, it finds the optimal subset by eliminating redundant features from the relevant features selected in step (i), using an approximate Markov blanket criterion. In essence, it decouples the relevance and redundancy analysis, and circumvents the concurrent subset search and subset evaluation process. Unlike CFS and mRMR, FCBF is computationally fast, simple, and fairly easy to implement due to the sequential 2-step process. However, this method fails to capture situations where feature dependencies appear only conditionally on the class variable (Fleuret 2004). Zhang and Zhang (2012) state that FCBF suffers from instability as the naive heuristics FCBF may be unsuitable in many situations. One of the drawbacks of FCBF is that it rules out the possibility of an irrelevant feature becoming relevant due to interaction with other features (Guyon and Elisseeff 2003). CMIM, which simultaneously evaluates relevance and redundancy at every iteration, overcomes this limitation.

Relief (Kira and Rendell 1992), and its multi-class version ReliefF (Kononenko 1994), are instance-based feature ranking algorithms that rank each feature based on its similarity with k nearest neighbors from the same and opposite classes, selected randomly from the dataset. The underlying principle is that a useful feature should have the same value for instances from the same class and different values for instances from a different class. In this method, m instances are randomly selected from the training data and for each of these m instances, n nearest neighbors are chosen from the same and the opposite class. Values of features of the nearest neighbors are compared with the sample instance and the scores for each feature is updated. A feature has higher weights if it has the same value with instances from the same class and different values to others. In Relief, the score or weight of each feature is measured by the Euclidean distance between the sampled instance and the nearest neighbor, which reflects its ability to discriminate between different classes.

The consistency-based method (Almuallim and Dietterich 1991; Liu and Setiono 1996; Dash and Liu 2003) is another approach, which uses consistency measure as the performance metric. A feature subset is inconsistent if there exist at least two instances with the same feature values but with different class labels. The inconsistency rate of a dataset is the number of inconsistent instances divided by the total number of instances in it. This approach aims to find a subset, whose size is minimal and inconsistency rate is equal to that of the original dataset. Liu and Setiono (1996) propose the following heuristic: \(Consistency (S) = 1- \frac{\sum \nolimits _{i=0}^{m} (|D_{i}|-|M_{i}|)}{N}\), where, m is the number of distinct combinations of feature values for subset S, \(|D_{i}|\) is the number of instances of i-th feature value combination, \(|M_{i}|\) is the cardinality of the majority class of i-th feature value combination, and N is the number of total instances in the dataset.

Markov Blanket (MB) filter (Koller and Sahami 1996) provides another useful technique for variable selection. MB filter works on the principle of conditional independence and excludes a feature only if the MB of the feature is within the set of remaining features. Though the MB framework based on information theory is theoretically optimal in eliminating irrelevant and redundant feature, it is computationally intractable. Incremental association Markov blanket (IAMB) (Tsamardinos et al. 2003) and Fast-IAMB (Yaramakala and Margaritis 2005) are two MB based algorithms that use conditional mutual information as the metric for conditional independence test. They address the drawback of CMIM by performing redundancy checks during ‘growing’ (forward pass) and ‘shrinkage’ phase (backward pass).

3.2 Complementarity-based methods

The literature on complementarity-based feature selection that simultaneously optimize redundancy and complementarity are relatively few, despite earliest research on feature interaction dating back to McGill (1954) and subsequently advanced by Yeung (1991), Jakulin and Bratko (2003), Jakulin and Bratko (2004) and Guyon and Elisseeff (2003). The feature selection methods that consider feature complementarity include double input symmetrical relevance (DISR) (Meyer et al. 2008), redundancy complementariness dispersion based feature selection (RCDFS) (Chen et al. 2015), INTERACT (Zhao and Liu 2007), interaction weight based feature selection (IWFS) (Zeng et al. 2015), and maximum relevance maximum complementary (MRMC) (Chernbumroong et al. 2015), joint mutual information (JMI) (Yang and Moody 1999; Meyer et al. 2008), and min-Interaction Max-Relevancy (mIMR) (Bontempi and Meyer 2010).

The goal of DISR is to find the best subset of a given size d, where d is assumed to be known a priori. It considers complementarity ‘implicitly’ (Bontempi and Meyer 2010), which means they consider the net effect of redundancy and complementarity in the search process. As a result, DISR does not distinguish between two subsets \(S_{1}\) and \(S_{2}\), where \(S_{1}\) has information gain\(=\) 0.9 and information loss\(=\) 0.1, and \(S_{2}\) has information gain \(=\) 0.8 and information loss \(=\) 0. In other words, information loss and information gain are treated equally. DISR works on the principle of k-average sub-subset information criterion, which is shown to be a good approximation of the information of a set of features. They show that the mutual information between a subset \(\mathbf{F }_{S}\) of d features, and the class variable Y is lower bounded by the average information of its subsets. Using notations, \( \frac{1}{k!{{d}\atopwithdelims (){k}}} \sum \limits _{V \subseteq S:|V|=k} I(\mathbf{F }_{V};Y) \le I(\mathbf{F }_{S};Y)\). k is considered to be size of the sub-subset such that there is no complementarities of order greater than k. Using \(k=2\), DISR recursively decomposes each bigger subset \((d > 2)\) into subsets containing 2 features \(F_{i}\) and \(F_{j}\, (i \ne j)\), and chooses a subset \(\mathbf{F }_{S}\) such that \(\mathbf{F }_{S} = \arg \max \limits _{S} {} \sum \nolimits _{i,j,\, i < j }^{} I (F_{i},F_{j};Y) / {{d}\atopwithdelims (){2}}\). An implementation of this heuristic, known as MASSIVE is also proposed.

mIMR method presents another variation of DISR in that (1) mIMR first removes all features that have zero mutual information with the class, and (2) it decomposes the multivariate term in DISR into a linear combination of relevance and interaction terms. mIMR considers causal discovery in the selection process; restricts the selection to variables that have both positive relevance and negative interaction. Both DISR and mIMR belong to a framework, known as Joint Mutual Information (JMI) initially proposed by Yang and Moody (1999). JMI provides a sequential feature selection method in which the JMI score of the incoming feature \(F_{k} \) is given by \(J_{jmi} (F_{k}) = \sum \nolimits _{F_{i}\in \mathbf{F }_{S}} I(F_{k},F_{i};Y)\) where \(F_{S}\) is the already selected subset. This is the information between the targets and the joint random variable \((F_{k},F_{i})\) defined by pairing the candidate \(F_{k}\) with each feature previously selected.

In RCDFS, Chen et al. (2015) suggest that ignoring higher order feature dependence may lead to false positives (FP) (actually redundant features misidentified as relevant due to pairwise approximation) being selected in the optimal subset, which may impair the selection of subsequent features. The degree of interference depends on the number of FPs present in the already selected subset and their degree of correlation with the incoming candidate feature. Only when the true positives (TPs) and FPs have opposing influence on the candidate feature, the selection is misguided. For instance, if the candidate feature is redundant to the FPs but complementary to the TPs, then new feature will be discouraged from selection, while it should be ideally selected and vice-versa. They estimate the interaction information (complementarity or redundancy) of the candidate feature with each of already selected features. They propose to measure this noise by standard deviation (dispersion) of these interaction effects and minimize it. The smaller the dispersion, the less influential is the interference effect of false positives.

One limitation of RCDFS is that it assumes that all TPs in the already selected subset will exhibit a similar type of association, i.e., either all are complementary to, or all are redundant with, the candidate feature (see Figure 1 in Chen et al. 2015). This is a strong assumption and need not be necessarily true. In fact, it is more likely that different dispersion patterns could be observed. In such cases, the proposed method will fail to differentiate between the ‘good influence’ (due to TPs) and ‘bad influence’ (due to FPs) and therefore will be ineffective in mitigating the interference effect of FPs in the feature selection process.

Zeng et al. (2015) propose a complementarity-based ranking algorithm, IWFS. Their method is based on interaction weight factors, which reflect the information on whether a feature is redundant or complementary. The interaction weight for a feature is updated at each iteration, and a feature is selected if its interaction weight exceeds a given threshold. Another complementarity-based method, INTERACT, uses a feature sorting metric using data consistency. The c-contribution of a feature is estimated based on its inconsistency rate. A feature is removed if its c-contribution is less than a given threshold of c-contribution, otherwise retained. This method is computationally intensive and has worst-case time complexity \(O(N^2M)\), where N is the number of instances and M is the number of features. MRMC method presents a neural network based feature selection that uses relevance and complementary score. Relevance and complementary scores are estimated based on how a feature influences or complements the networks.

Brown et al. (2012) propose a space of potential criterion that encompasses several redundancy and complementarity-based methods. They propose that the worth of a candidate feature \(F_{k}\) given already selected subset \(\mathbf{F }_{S}\) can be represented as \(J(F_{k}) = I(F_{k};Y)- \beta \sum \nolimits _{F_{i}\in \mathbf{F }_{S}} I(F_{k};F_{i}) + \gamma \sum \nolimits _{F_{i}\in \mathbf{F }_{S}} I(F_{k};F_{i}|Y)\). Different values of \(\beta \) and \(\gamma \) lead to different feature selection methods. For example, \(\gamma =0\) leads to MIFS, \(\beta = \gamma = \frac{1}{|S|}\) lead to JMI, \(\gamma =0\), and \(\beta = \frac{1}{|S|}\) lead to mRMR.

4 Motivation and the proposed heuristic

In this section, we first outline the motivation behind using redundancy and complementarity ‘explicitly’ in the search process and the use of an implicit utility function approach. Then, we propose a heuristic, called self-adaptive feature evaluation (SAFE). SAFE is motivated by the implicit utility function approach in multi-objective optimization. Implicit utility function approach belongs to interactive methods of optimization (Roy 1971; Rosenthal 1985), which combines the search process with the decision maker’s relative preference over multiple objectives. In interactive method, decision making and optimization occur simultaneously.

4.1 Implicit versus explicit measurement of complementarity

Combining negative (complementarity) and positive (redundancy) interaction information may produce inconsistent results when the goal is to find an optimal feature subset. In this section, we demonstrate this using the ‘Golf’ dataset presented in Table 1. The dataset has four features \(\{F_{1},\ldots ,F_{4}\}\). The information content \(I(\mathbf{F }_{S};Y)\) of each possible subset \(\mathbf{F }_{S}\) is estimated (1) first, using Eq. (1), and (2) then using the aggregate approach. In the aggregate approach, to compute \(I(\mathbf{F }_{S};Y)\), we take the average of mutual information of each sub-subsets of two features in \(\mathbf{F }_{S}\) with the class variable, i.e., \(I(F_{i},F_{j};Y)\). For example, \(I(F_{1},F_{2},F_{3};Y)\) is approximated by the average of \(I(F_{1},F_{2};Y)\), \(I(F_{1},F_{3};Y)\), and \(I(F_{2},F_{3};Y)\). Table 2 presents the results. Mutual information is computed using infotheo package in R and empirical entropy estimator.

Table 1 Golf dataset
Table 2 Mutual information between a subset \(F_{S}\) and the class Y

Our goal is to find the optimal subset regardless of the subset size. Clearly, in this example, the \(\{F_{1},F_{2},F_{3},F_{4}\}\) is the optimal subset that has maximum information about the class variable. However, using aggregate approach, \(\{F_{1},F_{3},F_{4}\}\) is the best subset. Moreover, in the aggregate approach, one would assign a higher rank to the subset \(\{F_{1},F_{2},F_{3}\}\) as compared to \(\{F_{1},F_{2},F_{4}\}\), though the latter subset has higher information content than the former.

4.2 A new adaptive heuristic

We first introduce the following notations for our heuristic and then define the adaptive cost function.

Subset Relevance Given a subset S, subset relevance, denoted by \(A_{S}\), is defined by summation of all pairwise mutual information between each feature and the class variable, i.e., \(A_{S} = \sum \nolimits _{i \in S } I(F_{i};Y)\). \(A_{S}\) measures the predictive ability of each individual feature acting alone.

Subset Redundancy Given a subset S, subset redundancy, denoted by \(R_{S}\), is defined by the summation of all positive 3-way interactions in the subset, i.e., \(R_{S} = \sum \nolimits _{i,j \in S , i < j }^{} (I(F_{i}; F_{j})-I(F_{i}; F_{j}| Y)) \; \forall \; (i,j)\) such that \(I(F_{i}; F_{j})>I(F_{i}; F_{j} | Y)\). \(R_{S}\) measures information loss due to feature redundancy.

Subset Complementarity Given a subset S, subset complementarity, denoted by \(C_{S}\), is defined by the absolute value of the sum of all negative 3-way interactions in the subset, i.e., \(C_{S} = \sum \nolimits _{i,j \in S , i < j}^{} (I(F_{i}; F_{j} | Y)- I(F_{i}; F_{j})) \, \forall \;(i, j)\) such that \(I(F_{i}; F_{j}) < I(F_{i}; F_{j} | Y)\). \(C_{S}\) measures information gain due to feature complementarity.

Subset Dependence Given a subset S, subset dependence, denoted by \(D_{S}\), is defined by the summation of mutual information between each pair of features, i.e., \( D_{S} = \sum \nolimits _{i,j \in S , i < j} I(F_{i}; F_{j})\). We use \(D_{S}\) as a measure of unconditional feature redundancy in our heuristic. This is the same as the unconditional mutual information between features described as redundancy in the literature (Battiti 1994; Hall 2000; Peng et al. 2005). We call this subset dependence to distinguish this from the information loss due to redundancy (\(R_{S}\)), which is measured by the difference between conditional and unconditional mutual information. Below, we present the proposed heuristic.

$$\begin{aligned} Score(S) = \frac{A_{S} + \gamma \; C_{S}^{\frac{\beta }{| S |}}}{\sqrt{| S | + \beta \; D_{S}}} \end{aligned}$$
(3)

where, \(A_{S}\), \(C_{S}\), and \(D_{S}\) are subset relevance, subset complementarity, and subset dependence, respectively, and |S| denotes the subset size. \(\beta \) and \(\gamma \) are hyper-parameters defined as follows: \( \alpha = \frac{R_{S}}{R_{S}+C_{S}}\), \(\beta = (1+ \alpha )\), \( \xi = \frac{C_{S}}{C_{S}+A_{S}}\) and \( \gamma = (1 - \xi )\). \(R_{S}\) measures subset redundancy as defined in Definition 4.2. As mentioned above, the heuristic characterizes an adaptive objective function, which evolves depending on the level of feature interactions. We model this adaptation using two hyper-parameters \(\beta \) and \(\gamma \). The values of these parameters are computed by the heuristic during the search process based on the relative values of relevance, redundancy and complementarity.

The ratio \( \alpha \in [0,1 ]\) measures the percentage of redundancy in the subset, which determines whether the subset is predominantly redundant or complementary. If \(\alpha = 1\), all features are pairwise redundant, we call the subset predominantly redundant. At the other extreme, if \(\alpha = 0\) all features are pairwise complementary, we call the subset predominantly complementary. We consider \(\alpha = 0/0 = 0\) for a fully independent subset of features which is, however, rarely the case. The hyper-parameter \(\beta \) controls the trade-off between redundancy and complementarity based on the value of \(\alpha \), which is a function of subset characteristics. We consider \(\beta \) as a linear function of \(\alpha \) such that the penalty for unconditional dependency increases linearly to twice its value when the subset is fully redundant. This resembles the heuristic of CFS (Hall 2000) when \(\alpha = 1\). The |S| in the denominator allows the heuristic to favor smaller subsets, while the square root in the denominator allows the penalty term to vary exponentially with increasing subset size and feature dependency.

Fig. 6
figure 6

Variation of subset complementarity with subset size

The proposed heuristic adaptively modifies the trade-off rule as the search process for the optimal subset continues. We explain how such an adaptive criterion works. As \(\alpha \) increases, the subset becomes more redundant, the value of subset complementarity \((C_{S})\) decreases (\(C_{S}=0\) when \(\alpha =1\)) leaving us with little opportunity to use complementarity very effectively for feature selection process. In other words, the value of subset complementarity \(C_{S}\) is not sufficiently large to be able to differentiate between the two subsets. At best, we expect to extract a set of features that are less redundant or nearly independent. As a result, \(\beta \in [1,2 ]\) increasingly penalizes the subset redundancy term \(D_{S}\) in the denominator and rewards subset complementarity \(C_{S}\) in the numerator as \(\alpha \) increases from 0 to 1.

In contrast, as \(\alpha \) decreases, the subset becomes predominantly complementary leading to an increase in the value \(C_{S}\). As the magnitude of \(C_{S}\) gets sufficiently large, complementarity plays the key role in differentiating between two subsets compared to subset dependence \(D_{S}\) in the denominator. This, however, causes a bias towards larger subset as the complementarity gain increases monotonically with the size of the subset. We observe that \(C_{S}\) increases exponentially with the logarithm of the subset size as \(\alpha \) decreases. In Fig. 6, we demonstrate this for three different real data sets having different degrees of redundancy (\(\alpha \)). In order to control this bias, we raise \(C_{S}\) to the exponent \(\frac{1}{|S|}\). Moreover, given the way we formalize \(\alpha \), it is possible to have two different subsets, both having \(\alpha = 0\) but with different degrees of information gain \(C_{S}\). This is because the information gain \(C_{S}\) of a subset is affected by both the number of complementary pair of features in the subset as well as how complementary each pair of features is, which is an intrinsic property of each feature. Hence it is possible that a larger subset with weak complementarity produces the same amount of information gain \(C_{S}\) as another smaller subset with highly complementary features. This is evident from Fig. 6, which shows that subset complementarity gains faster for dataset ‘Lung Cancer’ compared to data set ‘Promoter,’ despite the fact that the latter has lower \(\alpha \), and both have an identical number of features. The exponent \(\frac{1}{|S|}\) also takes care of such issues.

Next, we examine scenarios when \(C_{S}>A_{S}\). This is the case when a subset learns more information from interaction with other features compared to their individual predictive power. Notice that the source of class information, whether \(C_{S}\) or \(A_{S}\), is indistinguishable to the heuristic as they are linearly additive in the numerator. As a result, this produces an undue bias towards larger subset size when \(C_{S}>A_{S}\). To control this bias, we introduce the hyper-parameter \(\gamma \in [0,1]\) that maintains the balance between relevance and complementarity information gain. It controls the subset size by reducing the contribution from \(C_{S}\) when \(C_{S}>A_{S}\).

Fig. 7
figure 7

Heuristic score variation with degree of redundancy \(\alpha \) and subset size, given \(A_{S}=200\), \(D_{S}=20\), and \(R_{S}+C_{S}=20\)

Fig. 8
figure 8

Heuristic score variation with degree of redundancy \(\alpha \) and subset size, given \(A_{S}=20\), \(D_{S}=20\), and \(R_{S}+C_{S}=200\)

Fig. 9
figure 9

Heuristic score variation with subset dependence and subset relevance, given \(\alpha =0.5\), \(|S|=5\), and \(R_{S}+C_{S}=20\)

Fig. 10
figure 10

Heuristic score variation with subset dependence and subset relevance, given \(\alpha =0.5\), \(|S|=25\), and \(R_{S}+C_{S}=20\)

4.3 Sensitivity analysis

In Figs. 7 and 8, we show how the proposed heuristic score varies with degree of redundancy \(\alpha \), and the subset size under different relative magnitude of interaction information \((R_{S}+C_{S})\), and subset relevance \(A_{S}\). Figure 7 depicts a situation in which features are individually highly relevant, but less interactive \((C_{S} < A_{S})\), whereas in Fig. 8, features are individually less relevant, but as a subset become extremely relevant due to high degree of feature interaction \((C_{S} > A_{S})\). In either scenario, the score decreases with increasing subset size, and with increasing degree of redundancy. For a given subset size, the score is generally lower when \(C_{S} > A_{S}\) compared to when \(C_{S} < A_{S}\), showing the heuristic is effective in controlling the subset size when the subset is predominantly complementary. We also observe that the heuristic is very sensitive to redundancy when the features are highly relevant. In other words, redundancy hurts much more when features are highly relevant. This is evident from the fact that the score reduces at a much faster rate with increasing redundancy when \(A_{S}\) is very high compared to \(C_{S}\) as in Fig. 7.

In Figs. 9 and 10, we show how the proposed heuristic score varies with subset relevance and subset dependence for two different subset sizes. The score increases linearly with the subset relevance, and decreases non-linearly with the subset dependence, as we would expect from Eq. 3. The score is higher for a smaller subset compared to a bigger subset under nearly identical conditions. However, for a given subset relevance, the score decreases at a much faster rate with increasing subset dependence when there are fewer number of features in the subset (as in Fig. 9). This phenomenon can be explained with the help of Figs. 11 and 12. For a given subset dependence \(D_{S}\), fewer features would mean higher degree of association (overlap) between features. In other words, features share more common information, and are therefore more redundant compared to when there are higher number of features in the subset. Hence, our heuristic not only encourages parsimony, but is also sensitive to the change in feature dependence as the subset size changes. The above discussion demonstrates the adaptive nature of the heuristic under different conditions of relevance and redundancy, which is the motivation behind our heuristic.

Fig. 11
figure 11

Fewer features in the subset for given subset dependence \(D_{S}\)

Fig. 12
figure 12

More features in the subset for given subset dependence \(D_{S}\)

4.4 Limitations

One limitation of this heuristic, as evident from Figs. 9 and 10, is that it assigns a zero score to a subset when the subset relevance is zero, i.e., \(A_{S}=0\). Since feature relevance is a non-negative measure, this implies a situation in which every feature in the subset is individually irrelevant to the target concept. Thus, our heuristic does not select a subset when none of the features in the subset carry any useful class information by themselves. This, however, ignores the possibility that they can become relevant due to interaction. However, we have not encountered datasets where this is the case.

Another limitation of our heuristic is that it considers up to 3-way feature interaction (interaction between a pair of features and the class variable) and ignores the higher-order corrective terms. This pairwise approximation is necessary for computational tractability and is considered in many MI-based feature selection methods (Brown 2009). Despite this limitation, Kojadinovic (2005) shows that Eq. 1 produces reasonably good estimates of mutual information for all practical purposes. The higher order corrective terms become significant only when there exists a very high degree of dependence amongst a large number of features. It may be noted that our definition of subset complementarity and subset redundancy, as given in Sect. 4.2, can be extended to include higher-order interaction terms without any difficulty. With more precise estimates of mutual information becoming available, further work will address the merit of using higher order correction terms in our proposed approach.

5 Algorithm

In this section, we present the algorithm for our proposed heuristic SAFE.

  1. Step 1

    Assume we start with a training sample \(D(\mathbf{F },Y)\) with full feature set \(\mathbf{F }=\{F_{1},\ldots ,F_{n}\}\) and class variable Y. Using a search strategy, we choose a candidate subset of features \(\mathbf{S } \subset \mathbf{F } \).

  2. Step 2

    Using the training data, we compute the mutual information between each pair of features \(I(F_{i};F_{j})\), and between each feature and the class variable \(I(F_{i};Y)\). We eliminate all constant valued features from \(\mathbf{S }\), for which \(I(F_{i};Y)=0 \).

  3. Step 3

    For each pair of features in \(\mathbf{S }\), we compute the conditional mutual information given the class variable, \(I(F_{i};F_{j}| Y)\).

  4. Step 4

    We transform all \(I(F_{i};F_{j})\) and \(I(F_{i};F_{j}|Y)\) to their symmetric uncertainty form to maintain a scale between [0,1].

  5. Step 5

    For each pair of features, we compute the interaction gain or loss, i.e., \(I(F_{i};F_{j};Y) =I(F_{i};F_{j})- I(F_{i};F_{j}|Y) \) using the symmetric uncertainty measures.

  6. Step 6

    We compute \(A_{S}\), \(D_{S}\), \(C_{S}\), \(R_{S}\), \(\beta \), and \(\gamma \) using information from Steps 4 & 5.

  7. Step 7

    Using information from Step 6, the heuristic determines a \(Score(\mathbf{S })\) for subset \(\mathbf{S }\). The search continues, and a subset \(\mathbf{S }_{opt}\) is chosen that maximizes this score.

The pseudo-algorithm of the proposed heuristic is presented in Algorithm 1.

figure a

5.1 Time complexity

The proposed heuristic provides a subset evaluation criterion, which can be used as the heuristic function for determining the score of any subset in the search process. Generating candidate subsets for evaluation is generally a heuristic search process, as searching \(2^n\) subsets is computationally intractable for large n. As a result, different search strategies, such as sequential, random, and complete are adopted. In this paper, we use the best-first search (BFS) (Rich and Knight 1991) to select candidate subsets using the heuristic as the evaluation function.

BFS is a sequential search that expands on the most promising node according to some specified rule. Unlike depth-first or breadth-first method, which selects a feature blindly, BFS carries out informed search and expands the tree by spitting on the feature that maximizes the heuristic score, and allows backtracking during the search process. BFS moves through the search space by making small changes to the current subset, and is able to backtrack to previous subset if that is more promising than the path being searched. Though BFS is exhaustive in pure form, by using a suitable stopping criterion, the probability of searching the entire feature space can be considerably minimized.

To evaluate a subset of k features, we need to estimate \(k(k-1)/2\) mutual informations between each pair of features, and k mutual information between each feature and the class variable. Hence, the time complexity of this operation is \(O(k^2)\). To compute the interaction information, we need \(k(k-1)/2\) linear operations (substraction). Hence the worst time complexity of the heuristic is \(O(n^2)\), when all features are selected. However, this case is rare. Since best-first is forward sequential search method, there is no need to pre-compute \(n\times n\) matrix of mutual information pairs in advance. As the search progresses, the computation is done progressively, requiring only incremental computations at each iteration. Using a suitable criterion (maximum number of backtracks), we can restrict the time complexity of BFS. For all practical data sets, the best-first search converges to a solution quickly. Despite that, the computational speed of the heuristic slows down as the number of input features become very large requiring more efficient computation of mutual information. Other search methods such as forward search or branch and bound (Narendra and Fukunaga 1977) method can also be used.

6 Experiments on artificial datasets

In this section, we evaluate the proposed heuristic using artificial data sets. In our experiments, we compare our method with 11 existing feature selection methods: CFS (Hall 2000), ConsFS (Dash and Liu 2003), mRMR (Peng et al. 2005), FCBF (Yu and Liu 2003), ReliefF (Kononenko 1994), MIFS (Battiti 1994), DISR (Meyer and Bontempi 2006), IWFS (Zeng et al. 2015), mIMR (Bontempi and Meyer 2010), JMI (Yang and Moody 1999), and IAMB (Tsamardinos et al. 2003). For IAMB, 4 different conditional independence tests (“mi”,“mi-adf”,“\(\chi ^2\)”,“\(\chi ^2\)-adf”) are considered and the union of each Markov blanket is considered as the feature subset. Experiments using artificial datasets help us to validate how well the heuristic deals with irrelevant, redundant, and complementary features because the salient features and the underlying relationship with the class variable are known in advance. We use two multi-level data sets \(D_{1}\) and \(D_{2}\) from Doquire and Verleysen (2013) for our experiment. Each dataset has 1000 randomly selected instances, 4 labels, and 8 classes. For the feature ranking algorithms, such as ReliefF, MIFS, IWFS, DISR, JMI, mIMR, we terminate when \(I(\mathbf{F }_{S};Y) \approx I(\mathbf{F };Y)\) estimated using Eq. 1, i.e., when all the relevant features are selected (Zeng et al. 2015). For large datasets, however, this information criteria may be time-intensive. Therefore, we restrict the subset size to a maximum of \(50\%\) of the initial number of features when we test our heuristic on real data sets in Sect. 7. For example, Zeng et al. (2015) restrict to a maximum of 30 features since the aim of feature selection is to select a smaller subset from the original features. For subset selection algorithms, such as CFS, mRMR, and SAFE, we use best-first search for subset generation.

6.1 Synthetic datasets

\(D_{1}\): The data set contains 10 features \(\{f_{1},\ldots ,f_{10}\}\) drawn from a uniform distribution on the [0, 1] interval. 5 supplementary features are constructed as follows: \(f_{11} =(f_{1}-f_{2})/2\), \(f_{12} =(f_{1}+f_{2})/2\), \(f_{13} =f_{3}+0.1\), \(f_{43} =f_{4}-0.2\), and \(f_{15} =2\,{f_{5}}\). The multi-label output \(O=[O^1\ldots O^4]\) is constructed by concatenating the four binary outputs \(O^1\) through \(O^4\) evaluated as follows. This multi-label output O is the class variable Y for the classification problem, which has 8 different class labels. For example, [1001] represents a class label formed by \(O^1=1\), \(O^2=0\), \(O^3=0\), and \(O^4=1\).

$$\begin{aligned} \left\{ \begin{array}{ll} O^1=1 \quad { if} \quad f_{1}>f_{2} \\ O^2=1 \quad { if} \quad f_{4}>f_{3} \\ O^3=1 \quad { if} \quad O^1 +O^2=1 \\ O^4=1 \quad { if} \quad f_{5}>0.8 \\ O^i=0 \quad { otherwise} \quad (i=1,2,3,4) \\ \end{array} \right. \end{aligned}$$
(4)

The relevant features are \(f_{11}\) (or \(f_{1}\) and \(f_{2}\)), \(f_{3}\) (or \(f_{13}\)), \(f_{4}\) (or \(f_{14}\)), and \(f_{5}\) (or \(f_{15}\)). Remaining features are irrelevant for the class variable.

\(D_{2}\): The data set contains 8 features \(\{f_{1},\ldots ,f_{8}\}\) drawn from a uniform distribution on the [0, 1] interval. The multi-label output \(O=[O^1\ldots O^4]\) is constructed as follows:

$$\begin{aligned} \left\{ \begin{array}{ll} O^1=1 \quad { if} \quad (f_{1}>0.5 \; { and} \; f_{2}>0.5) \; { or} \;(f_{1}<0.5 \; { and} \; f_{2}<0.5)\\ O^2=1 \quad { if} \quad (f_{3}>0.5 \; { and} \; f_{4}>0.5) \; { or} \;(f_{3}<0.5 \; { and} \; f_{4}<0.5)\\ O^3=1 \quad { if} \quad (f_{1}>0.5 \; { and} \; f_{4}>0.5) \; { or} \;(f_{1}<0.5 \; { and} \; f_{4}<0.5)\\ O^4=1 \quad { if} \quad (f_{2}>0.5 \; { and} \; f_{3}>0.5) \; { or} \;(f_{2}<0.5 \; { and} \; f_{3}<0.5)\\ O^i=0 \quad { otherwise} \quad (i=1,2,3,4) \\ \end{array} \right. \end{aligned}$$
(5)

The relevant features are \(f_{1}\) to \(f_{4}\). Remaining features are irrelevant for the class variable. The dataset \(D_{2}\) reflects higher level of feature interaction. The features are relevant only if considered in pairs. For example, the features \(f_{1}\) and \(f_{2}\) together define the class \(O^1\), neither \(f_{1}\) nor \(f_{2}\) alone can do. The same observation applies to other pairs: (\(f_{3},f_{4}\)), (\(f_{1},f_{4}\)), and (\(f_{2},f_{3}\)).

6.2 Data pre-processing

In this section, we discuss two important data pre-processing steps\(\textemdash \)imputation and discretization. We also discuss the packages used in the computation of mutual information for our experiments.

6.2.1 Imputation

Missing data arise in almost all statistical analyses due to various reasons. For example, missing values could be completely at random, at random, or not at random (Little and Rubin 2014). In such a situation, we can either discard those observations with missing values, or use expectation-maximization algorithm (Dempster et al. 1977) to estimate parameters of a model in the presence of missing data, or use imputation. Imputation (Hastie et al. 1999; Troyanskaya et al. 2001) provides a way to estimate the missing values of features. There are several methods in the literature for imputation, of which we use kNN method of imputation, which is used widely. kNN imputation method (Batista and Monard 2002) imputes missing values of a feature using the most frequent value from k nearest neighbors for discrete variable and using weighted average of k nearest neighbors for continuous variable, where weights are based on some distance measure between the instance and its nearest neighbors. As some of the real datasets used in our experiments have missing values, we use kNN imputation with \(k=5\) for imputing the missing values. Using higher values of k presents a trade-off between accuracy of imputed values and computation time.

Table 3 Results of experiment on artificial dataset \(D_{1}\)

6.2.2 Discretization

Computation of mutual information of continuous features requires the continuous features to be discretized. Discretization refers to the process of partitioning continuous features into some discrete intervals or nominal values. However, there is always some discretization error or information loss, which needs to be minimized. Dougherty et al. (1995) and Kotsiantis and Kanellopoulos (2006) present a survey of various discretization methods present in the literature. In our experiment, we discretize the continuous features into nominal ones using minimum description length (MDL) method (Fayyad and Irani 1993). The MDL principle states the best hypothesis is the one with minimum description length. While partitioning a continuous variable into smaller discrete intervals reduces the value of entropy function, too fine grained partition increases the risk of over-fitting. MDL principle enables us to balance between the number of discrete intervals and the information gain. Fayyad and Irani (1993) use mutual information to recursively define the best bins or intervals coupled with MDL criterion (Rissanen 1986). We use this method to discretize continuous features in all our experiments.

6.2.3 Estimation of mutual information

For all experiments, mutual information is computed using infotheo package in R and empirical entropy estimator. The experiments are carried out using a computer with Windows 7, i5 processor, 2.9 GHz, and statistical package R (R Core Team 2013).

6.3 Experimental results

The results of the experiment on synthetic dataset \(D_{1}\) is given in Table 3. Except IAMB, all feature selection methods are able to select the relevant features. Five out of 12 methods including SAFE are able to select an optimal subset. mRMR selects the maximum number of features including 5 irrelevant features and 1 redundant feature, and IAMB selects only 1 feature. mIMR, and ReliefF select maximum number of redundant features and mRMR selects maximum number of irrelevant features.

The results of experiment on synthetic dataset \(D_{2}\) is given in Table 4. ConsFS, FCBF, IAMB and mIMR fail to select all relevant features. As discussed in Sect. 6.1, features are pairwise relevant. In the absence of an interactive feature, some apparently useful features become irrelevant, failing to represent some class labels. Those unrepresented class labels are given in the third column of Table 4. Five out of 12 feature selection methods including SAFE are able to select an optimal subset. CFS, ConsFS, mRMR, MIFS, IAMB, and mIMR fail to remove all irrelevant features. FCBF performs poorly on this dataset. The experimental results show that SAFE can identify the relevant and interactive features effectively, and can also remove the irrelevant and redundant features.

Table 4 Results of experiment on artificial dataset \(D_{2}\)

7 Experiments on real datasets

In this section, we describe the experimental set-up, and evaluate the performance of our proposed heuristic using 25 real benchmark datasets.

7.1 Benchmark datasets

To validate the performance of the proposed algorithm, 25 benchmark datasets from UCI Machine Learning Repository are used in our experiment, which are widely used in the literature. Table 5 summarizes general information about these datasets. Note that these datasets greatly vary in the number of features (max \(=\) 1558, min \(=\) 10), type of variables (real, integer and nominal), number of classes (max \(=\) 22, min \(=\) 2), sample size (max \(=\) 9822, min \(=\) 32), and extent of missing values, which can provide comprehensive testing, and robustness checks under different conditions.

7.2 Validation classifiers

To test the robustness of our method, we use 6 classifiers, naïve Bayes (NB) (John and Langley 1995), logistic regression (LR) (Cox 1958), regularized discriminant analysis (RDA) (Friedman 1989), support vector machine (SVM) (Cristianini and Shawe-Taylor 2000), k-nearest neighbor (kNN) (Aha et al. 1991), and C4.5 (Quinlan 1986; Breiman et al. 1984). These classifiers are not only popular, but also have distinct learning mechanism and model assumptions. The aim is to test the overall performance of the proposed feature selection heuristic for different classifiers.

Table 5 Datasets description

7.3 Experimental setup

We split each dataset into a training set \((70\%)\) and test set \((30\%)\) using stratified random sampling. Since the dataset has unbalanced class distribution, we adopt stratified sampling to ensure that both the training and test set represents each class in proportion to their size in the overall dataset. For the same reason, we choose balanced average accuracy over classification accuracy to measure the classification error rate. Balanced average accuracy is a measure of classification accuracy appropriate for unbalanced class distribution. For a 2-class problem, the balanced accuracy is the average of specificity and sensitivity. For a multi-class problem, it adopts a ‘one-versus-all’ approach and estimates the balanced accuracy for each class and finally takes the average. Balanced average accuracy takes care of the over-fitting bias of the classifier that results from unbalanced class distribution.

In the next step, each feature selection method is employed to select a smaller subset of features from the original features using the training data. We train each classifier on the training set based on selected features, learn its parameters, and then estimate its balanced average accuracy on the test set. We repeat this process 100 times using different random splits of the dataset, and report the average result. The accuracies obtained using different random splits are observed to be approximately normally distributed. So, the average accuracy of \(\mathbf SAFE \) is compared with other methods using paired t test at \(5\%\) significance level and significant wins/ties/losses(W/T/L) are reported. Since we compare our proposed method over multiple datasets, the \(p \) values have been adjusted using Benjamini–Hochberg adjustments for multiple testing (Benjamini and Hochberg 1995). We also report the number of features selected by each algorithm, and their computation time. Computation time is used as a proxy for the complexity level of the algorithm.

For feature ranking algorithms, we need a threshold to select the optimal subset from the list of ordered features. For ReliefF, we consider the threshold \(\delta =0.05\), which is common in the literature (Hall 2000; Ruiz et al. 2002; Koprinska 2009). For the remaining ones, we terminate when \(I(\mathbf{F }_{S};Y)\approx I(\mathbf{F };Y)\) or a maximum of \(50\%\) of the initial number of features is selected, whichever occurs earlier. For ReliefF, we consider \(k=5\), \(m=250\), and exponential decay in weights based on distance (Robnik-Šikonja and Kononenko 2003). The discretized data are used for both training the model and testing its accuracy on the test set.

Table 6 Comparison of balanced average accuracy of algorithms using NB classifier
Table 7 Comparison of balanced average accuracy of algorithms using LR classifier
Table 8 Comparison of balanced average accuracy of algorithms using RDA classifier
Table 9 Comparison of balanced average accuracy of algorithms using SVM classifier
Table 10 Comparison of balanced average accuracy of algorithms using kNN classifier
Table 11 Comparison of balanced average accuracy of algorithms using C4.5 classifier
Table 12 Summary of wins/ties/loses (W/T/L) results for SAFE

7.4 Experimental results

In this section, we present the results of the experiment, compare the accuracy, computation time, number of selected features by each method.

7.4.1 Accuracy comparison

Tables 6, 7, 8, 9, 10 and 11 show the balanced average accuracy of 12 different feature selection methods including SAFE, tested with 6 different classifiers for all 25 data sets, resulting in 1800 combinations. For each data set, the best average accuracy is shown in bold font. a(b) in the superscript denotes that our proposed method is significantly better (worse) than the other method using paired t test at \(5\%\) significance level after Benjamini–Hochberg adjustment for multiple comparisons. W/T/L denotes the number of datasets in which the proposed method SAFE wins, ties or loses with respect to the other feature selection methods. A summary of wins/ties/loses (W/T/L) results is given in Table 12. The average value of accuracy for each method over all data sets is also presented in the “Avg." row. The consFS method did not converge for 3 data sets within a threshold time of 30 min; those results are not reported.

The results show that SAFE generally outperforms (in terms of W/T/L) other feature selection methods for different classifier-dataset combinations. In all cases except one (MIFS with kNN classifier in Table 10), the number of wins exceeds the number of losses, which shows that our proposed heuristic is effective under different model assumptions and learning conditions. The difference, (W–L), is particularly large for 4 out of 6 classifiers in which cases SAFE wins in more than half of all data sets on average (NB \(= 60.3\%\), LR \(= 52.6\%\), SVM \(= 60\%\), C4.5 \(= 51.1\%\)). SAFE is competitive with most feature selection methods in case of kNN, winning in \(47.4\%\) of all datasets on average. Compared to other methods, SAFE achieves the highest average accuracy in 3 out of 6 classifiers (NB, kNN, and C4.5), having maximum for kNN and minimum for RDA.

One general observation is that the performance of SAFE is better than that of redundancy-based methods on an average (in terms of W/T/L), which shows that complementary-based feature selection yields a more predictive subset of features. The fact that SAFE performs well across various classifiers and various domains, is a test of robustness of the proposed heuristic. For example, NB is a probabilistic Bayesian classifier that assumes conditional independence amongst the features given the class variable, whereas LR makes no assumptions about conditional independence of the features. SAFE performs well in either case, which demonstrates that the performance of SAFE does not degrade when features are not conditionally independent given the class, which is a limitation of CFS. CFS defeats SAFE in only 4 data sets when NB classifier is used. In fact, SAFE focuses on the interaction gain or loss, i.e., \(I(F_{i};F_{j};Y) = I(F_{i};F_{j})- I(F_{i};F_{j}|Y)\), that is the difference between unconditional and conditional dependence. However, in cases where all features are pairwise redundant, it does not do better than CFS.

Compared to the interaction-based feature selection methods such as DISR, IWFS, JMI, mIMR, SAFE does better in most cases, the reason being SAFE explicitly measures the redundancy and complementarity, unlike DISR, which takes the aggregate effect. JMI is the most competitive in terms of W/T/L results, though SAFE outperforms JMI in 11 out of 25 datasets on average. Our heurisic SAFE is a complementary-based feature selection criterion, and is expected to show superior performance when the data sets contain complementary features. To demonstrate this, we focus on two data sets\(\textemdash \)lung cancer and promoter\(\textemdash \)that are highly complementary as shown in Fig. 6. The results show that SAFE mostly wins or ties, and loses in very few cases for these two data sets.

Fig. 13
figure 13

Plot showing variation of predictive accuracy with heuristic score for dataset ‘Dermatology’ \((\alpha = 0.80)\)

Fig. 14
figure 14

Plot showing variation of predictive accuracy with heuristic score for dataset ‘Cardiology’ \((\alpha = 0.21)\)

To evaluate how well our proposed heuristic corresponds to the actual predictive accuracy, we examine a plot of predictive accuracy versus heuristic score for several experimental data sets. We split each data set into \(70\%\) training set and \(30\%\) test set as before, and randomly select 100 subsets of features of varying sizes ranging from 1 and n, where n is the total number of features in the data set. For each data set, the heuristic score is estimated using the training set, and the accuracy is determined using the test set. As an illustration, we present the results for two datasets ‘Dermatology’ and ‘Cardiology’ in Figs. 13 and 14 respectively, the former being predominantly redundant (\(\alpha =0.80\)), and the latter being predominantly complementary (\(\alpha =0.21\)). The plots show that, in general, there exists a correspondence between accuracy and score. As the score increases, the predictive accuracy also tends to increase in most cases.

7.4.2 Runtime comparison

Computation time is an important criterion that measures the time complexity of an algorithm. Table 13 shows the average execution time (in s) taken by each feature selection method. The results show that FCBF is the fastest of all, while ConsFS is the most expensive. ConsFS does not converge to solutions for three data sets with threshold time of 30 min. SAFE comes third after FCBF and IAMB, showing that our proposed heuristic is not very expensive in terms of time complexity. Though the worst case time complexity of SAFE is \(O(n^2)\), in reality, the computation time for SAFE is acceptable.

7.4.3 Number of selected features

Table 14 shows the average number of features selected by each feature selection method. IAMB selects the least number of features, while ReliefF selects the maximum number of features. SAFE selects an average of 6.7 features, which is the third lowest after IAMB and FCBF. MIFS, DISR, mIMR, JMI, and ReliefF select more features compared to SAFE. All methods have been able to remove a large number of irrelevant and redundant features.

8 Summary and conclusion

The main goal of feature selection is to find a small subset of features from the original ones such that it is highly predictive of the class. In this paper, we propose a filter-based feature selection criterion that relies on feature complementarity for searching an optimal subset. Unlike the existing redundancy-based methods, which only depend on relevance and redundancy, our proposed approach also aims to maximize complementarity. Incorporating feature complementarity as an additional search criterion enables us to leverage the power of complementarity, resulting in a smaller and more predictive subset of features. Since redundancy is generally modeled using feature correlation, the existing redundancy-based methods penalize all dependencies regardless of whether such dependence increases the predictive power or reduces it. As a result, our proposed approach is able to distinguish complementary feature subset from a subset of independent features, which the existing redundancy-based approach fails to do.

Table 13 Average execution time (s) for each iteration
Table 14 Average number of features selected

Using information theory framework, we explicitly measure complementarity, and integrate this in an adaptive evaluation criterion based on an interactive approach of multi-objective optimization. A feature of the proposed heuristic is that it adaptively optimizes relevance, redundancy, and complementarity while minimizing the subset size. Such an adaptive scoring criterion is new in feature selection. The proposed method not only helps to remove irrelevant and redundant features, but also selects complementary features, thus enhancing the predictive power of a subset. Using benchmark data sets and different classifiers, the experimental results show that the proposed method outperforms many existing methods for most data sets. The proposed method has acceptable time complexity, and effectively removes a large number of features. This paper shows that the proposed complementary-based feature selection method can be used to improve the classification performance of many popular classifiers for real life problems.