1 Introduction

Nowadays, massive amounts of image data are available in our daily life, including web images and remote sensing images. Numerous features have been proposed to characterize an image, such as global features (color, GIST, shape, and texture) and local features (shape context, and histograms of oriented gradients). For texture feature, the total number of texture features is up to 30 types, such as local binary pattern (LBP) [1] and Gabor textures [2]. For color feature, there also exist several types, such as color histogram and color correlogram. Generally, images are always described by multiple features which are complementary to each other, thus selecting effective feature subset from a set of distinct features is a great challenge for data representation [3].

To handle this challenge, feature selection [48] and subspace learning [9, 10] have been developed to obtain suitable feature representations. Feature selection is commonly used as a preprocessing step for classification, so most feature selection algorithms are only designed for better predictability, such as high prediction accuracy. Although many feature selections have taken both feature relevance and redundancy into account simultaneously for predictability [11], they neglect stability [12]. If a feature selection method has poor stability, the selected feature subsets change significantly due to the variation of training data. Therefore, using only predictability to evaluate feature selection methods may result in inconsistent results of ranking for data representation.

On the other hand, each feature type describes image from a single cue and has its own specific property- and domain-specific meaning. Different from a scalar feature, feature types, which can be scalars, vectors, or matrices, are highly diverse in dimension and expression. However, existing methods simply ensemble the selection of each feature type [13] or concatenate all features types into a single vector [14]. These methods ignore the relation between different feature types. Moreover, they often select a common feature subset for all classes, while the feature subset might not be optimal for each class. According to ref. [14], one-versus-all strategy is employed to select class-specific features. Feature selection selects a subset from original features rather than obtain a low-dimensional subspace, thereby maintaining the physical meaning, which is beneficial for understanding of data [4]. Therefore, how to select a set of feature types and evaluate the contribution of these types for a specific class is critical for enhancing their interpretability of features.

To address the above-mentioned issues, a novel feature selection method is proposed to improve stability and interpretability without sacrificing predictability, which is the so-called SIP-FS. The main contributions of this paper are as follows. First, generalized correlation rather than mutual information is employed in minimal redundancy maximal relevance to determine what feature types contribute to a specific class, thereby enhancing the interpretability of features. Second, stability constraint is adopted in SIP-FS to select consistent results of ranking in the case of data variation.

The remainder of this paper is organized as follows. Section 2 presents the related work of feature selection including predictability, interpretability, and stability. Section 3 illustrates the proposed methodology and other feature selection methods using different criteria based on predictability, stability and interpretability. SIP-FS is presented in Section 4. Section 5 discusses the effects of parameters and performance comparisons of different methods. Finally, Section 6 concludes this paper

2 Related work of feature selection

2.1 Predictability

As an important technique for handling high-dimensional data, feature selection plays an important role in pattern recognition and machine learning. It can be divided into four categories: filter, wrapper, embedded, and hybrid methods [4]. In this study, we focus on the filter methods based on different evaluation measures, such as distance criterion (Relief and its variants ReliefF, IRelief [15]), separability criterion (Fisher Score [16]), correlation coefficient [17], consistency [18], and mutual information [11]. More details can refer to ref. [19]. In general, one-versus-all strategy is becoming increasingly used in feature selection methods to select class-specific features for a certain class rather than a common feature subset for all classes [14].

2.2 Interpretability

Most existing feature selection methods focus on predictability (e.g., prediction accuracy) without considering the correlation between different feature types, weakening the interpretability of selected results. However, different feature types exhibit various information, including statistical characteristics and domain-specific meanings. Given a set of distinct feature types, it remains unclear what feature types contribute to a specific class.

Haur et al. analyze the influence of feature selection methods on functional interpretability of the signatures [20]. Li et al. utilize association rule mining algorithms to improve the interpretability of the selected result without degrading prediction accuracy [21]. However, these feature selections are with less consideration of the correlation between two feature types. For different feature types, learning a shared subspace for all classes is a popular strategy to reduce the dimensionality. Although subspace-based methods are suitable for high-dimensional data, it learns a linear or non-linear embedding transformation rather than selects relevant and significant features from original feature types.

Thus, feature selection is becoming increasingly applied to obtain compact data representation. For example, Wang et al. [22] and Somol et al. [23] proposed to select the most discriminative feature types based on the relationships between different feature types, both methods are sparse feature selections rather than filter methods.

2.3 Stability

Feature selections can obtain inconsistent results with similar prediction accuracies in the case of data variation. However, a good feature selection method should be robust to data variation. Therefore, it is necessary to develop a stability measure for the results of different feature selections. To evaluate stability, numerous stability measures have been proposed. For example, Somol et al. [24] proposed a series of stability measurement, such as feature-focused versus subset-focused measures, selection-registering versus selection-exclusion-registering measures, and subset-size-biased versus subset-size-unbiased measures. At present, a wide variety of stability measures based on physical properties are defined for the comparison of feature subsets, including Hamming distance [25], Tanimoto distance [26], Average Tanimoto index [27], Ochiai coefficient [28], and other stability measures for subsets with different sizes [24]. For example, Spearman’s correlation [26] is used to measure the stability of two weighting vectors, where the top ranked features are set higher weights.

Many factors greatly affect the stability of feature selection, such as the number of samples and the criteria and complexity of feature selection. Although stability measures are widely used for evaluating the selected results, it is seldom incorporated into feature selection methods. To improve stability, numerous stable feature selection methods have been developed to deal with different sources of instabilities. These methods can be divided into four categories: (1) ensemble methods [2931], (2) sample weighting [32], (3) feature grouping [33], and (4) sample injection method [34]. In general, ensemble feature selection is the most popular topic compared with the others. An ensemble feature selection method consists of two steps: (1) creating a set of component feature selectors and (2) aggregating the results of component feature selectors into an ensemble output.

However, ensemble feature selection methods combine the selected results according to prediction accuracy, which may result in imbalance between stability and predictability. By contrast, the proposed SIP-FS adopts stability measure as an additional constrain in selection criterion to balance predictability and stability. To the best of our knowledge, both stability and interpretability are seldom explored simultaneously in existing feature selection methods.

3 Methodology

This section presents feature selections and their corresponding results using different criteria based on predictability, stability, and interpretability, as shown in Fig. 1. Suppose a feature set F with m -dimensional features f l is extracted using l different types for each image, denoted by F=[f1,f2,…f m ]. If the length of a given feature type G(i) is m i dimensions, denotes by \(G^{(i)} = \left [f_{1}^{(i)}, f_{2}^{(i)},\ldots,f_{m_{i}}^{(i)} \right ], \sum _{i=1}^{l}{m_{i}}=m\), then F can be denoted as \(F^{G}=\left [ G^{(1)},G^{(2)},\ldots,G^{(l)}\right ]=\left [ f_{1}^{(1)},f_{2}^{(1)},\ldots,f_{m_{l}}^{(1)},\ldots {f}_{1}^{(2)},f_{2}^{(2)},\ldots {f}_{m_{l}}^{(2)},\ldots {f}_{1}^{(l)},f_{2}^{(l)},\ldots {f}_{m_{l}}^{(l)}\right ]\). As shown in Fig. 1a, G(i) represents the i-th feature type with a specific color (green, yellow, red, etc); moreover, G(i) has its own specific property and dimensionality.

Fig. 1
figure 1

Feature selection criteria based on predictability, stability, and interpretability. al different types of feature, corresponding to l different colors. bd Three criteria with different combinations. e SIP-FS

For predictability, numerous filter models have been developed in feature selection. For example, Min-Redundancy and Max-Relevance (mRMR) [11], as a popular filter model, adopts the following criterion:

$$ f_{opt}=\arg\max(D-R) $$
(1)

where f opt denotes the optimal selected feature, D and R represent feature-class relevance and feature-feature redundancy, respectively. In particular, D and R are computed by:

$$ \max D(F,c), D =\frac{1}{|F|}\sum_{f_{i}\in F}I(f_{i};c) $$
(2)
$$ \max R(F), R = \frac{1}{|F|^{2}}\sum_{f_{i},f_{j}\in F}I\left(f_{i};f_{j}\right) $$
(3)

where |F| represents the dimensionality of the feature set, I(f i ;c) represents mutual information between individual feature f i in feature set F, and class c, I(f i ;f j ) represents mutual information between two individual features f i and f j in feature set F. From Eqs. (2) and (3), D and R in (1) are computed with the mean value of all feature-class relevance and feature-feature redundancy in the feature set F, respectively. In practice, the selection of the feature set can be achieved by near-optimal incremental search methods:

$$ \bar{f_{m}}=\arg\max_{f_{i}\in F-F^{\prime}}\left[I\left(f_{i},c\right)-\frac{1}{m-1}\sum_{f_{j}\in F^{\prime}}I\left(f_{i},f_{j}\right)\right] $$
(4)

where F represents m- 1-dimensional feature subset that has been already selected from F. Equation (4) aims to selecting the m-th from the candidate feature subset FF and implements trade-off between high class relevance and low feature redundancy. As shown in Fig. 1b, the features selected from the same feature type are scattering in terms of ranking, which affects the quantitative evaluation of multiple features, resulting in the lack of interpretability. In addition, the selected results may greatly change due to data fluctuation.

In addition to predictability, stability is another important measure in feature selection. Various stability evaluation indexes are only used to evaluate feature selection method rather than improve the stability of the method itself [24]. To the best of our knowledge, stability is seldom considered in feature selection criteria. Therefore, stability constraint is employed in this study to obtain robust selection results:

$$ f_{opt} = \arg\max(D-R+k{\times}S) $$
(5)

where S represents existing stability evaluation index. k is a parameter, which balances prediction factor (DR) and stability factor S. Then, the stability evaluation index can be computed by:

$$ S(f,F)=\frac{1}{i-1}\sum_{j=1}^{i-1}S\left(F_{f},F_{j}\right) $$
(6)
$$ S\left(F_{f},F_{j}\right)=\frac{|F_{f}\cap F_{j}|}{|F_{f}\cup F_{j}|} $$
(7)

where F f is the union between the selected features and the optimal feature f to be selected in the current selection, F j (j=1,2,…,i−1) represents the selected feature subset, and |F f F j | and |F f F j | represent the intersection and union between feature sets F f and F j , respectively. Unlike Eq. (1), both predictability and stability are used in the the feature selection criterion of Eq. (5). As shown as in Fig. 1c, stability constraint helps obtain consistent results of ranking.

Similar to predictability and stability, interpretability is essential for feature selection [20]. However, mutual information fails to measure the correlation between different types of features, as multivariate density estimation is hard to accurately estimate. Both Eqs. (1) and (5) fail to select interpretive results. Instead of mutual information, generalized correlation coefficient (GCC) is adopted to measure D and R from Eqs. (1) to (5) for preserving predictability. Given v−1 types of feature \(\bar {F}_{v-1}^G=\bar {G}^{(1)} \cap \bar {G}^{(2)}\cap \ldots \bar {G}^{(v-1)}\) selected from the entire feature set of l types \({F}_{v-1}^G={G}^{(1)}\cap {G}^{(2)}\cap \ldots {G}^{(v-1)}\), where \(\bar {G}^{(x)}\) denotes the x th selected feature type (x=1,2…,v−1), selecting the v th type \(\bar {G}^{(V)}\) from set \(\left \{F^{G} - F_{v-1}^{G}\right \}\) is based on the following condition:

$$ {\begin{aligned} \bar{G}^{(v-1)}\,=\,\arg\max_{G^{(j)}\in{F^{G}-F^{G'}_{v-1}}}\left[\rho\left(G^{(j),c}\right) \,-\,\frac{1}{v-1}\sum_{\bar{G}^{(i)}\in \bar{F}^{G}_{v-1}}\rho\left(G^{(j)}, \bar{G}^{(i)}\right)+k{\times}S\right] \end{aligned}} $$
(8)

where ρ represents generalized correlation coefficient between different feature types, \(\bar {G}^{(i)}\) the i-th selected feature type, and G(j) denotes a certain feature type from the candidate feature set, \(F^G-F^{G^{\prime }}_{v-1}\). Generalized correlation coefficient is degraded to Pearson’s correlation coefficient when the dimensionality of \(\bar {G^{(i)}}\) and G(j) is 1.

In the case of only using GCC in Eq. (8) when k=0, the corresponding feature selection takes predictability and interpretability into account, as shown in Fig. 1d. The selected features of the same feature type are close to each other while the corresponding ranking may greatly change due to data fluctuation. If k≠0 in Eq. (8), it means that the feature selection simultaneously takes predictability, stability, and interpretability into account, which is the so-called SIP-FS method in this paper, as shown in Fig. 1e. From an interpretative point of view, features selected by SIP-FS method are meaningful class-specific features [35] with the use of one-versus-all strategy.

4 SIP-FS algorithm

SIP-FS aims to select a reasonable and compact feature subset for data representation efficiently; thereby, the selected result should be meaningful and insensitive to data fluctuation as well as performing well in prediction accuracy.

SIP-FS is implemented by repeated iteration until stable and selects the feature subset obtained (uses the selected/obtained feature subset) at the last iteration as the final result. For the i-th iteration, k=λ1i and the stability S i is computed by the mean of all stabilities between F i and F j (j=1,2,…i−1), where F i and F j represent the i-th and the j-th selected feature subset, respectively.

$$ s_{i} = \frac{1}{i-1}\sum_{j=1}^{i-1}S\left(F_{i},F_{j}\right) $$
(9)

where \(S\left (F_{f},F_{j}\right)=\frac {|F_{f}\cap F_j|}{|F_{f}\cup F_j|}\). The iteration stops until the following condition is satisfied:

$$ |S_{i}-S_{i-1}| \to 0 $$
(10)

Each iteration consists of two parts: (1) selecting feature types, corresponding to steps 3 to 6 as shown in Algorithm 1 and (2) removing the redundancy from the selected feature type, corresponding to steps 7 to 12 as shown in Algorithm 1. In the first part, feature types are selected based on Eq. (8) until other feature types can not provide additional information, as (11).

$$ {\begin{aligned} \left| \left(D\left(\bar{F}^{G}_{v+1},c\right) \,-\,R\left(\bar{F}^{G}_{v+1}\right)\right)\,-\,\left(D\left(\bar{F}^{G}_{v},c\right) -R\left(\bar{F}^{G}_{v}\right)\right)\right| \to 0 \end{aligned}} $$
(11)

The first part could obtain the ranking of feature type; however, in each selected feature type, there may exist redundancy. Therefore, in the second part, the redundancy of each feature type is further removed by selecting a subset. Given that m−1 features are selected from the v-th feature type, the selection of the m-th feature \(\bar {f}_{m}^{(v)}\) is described as follows.

$$ {\begin{aligned} \bar{f}_{m}^{(v)} = \arg\max\left[\rho\left(G^{(v)}_{m},c\right) -\frac{1}{v-1}\sum_{\bar{G}^{(i)}\in \bar{F}^{G}_{v-1}}\rho\left(G^{(v)}_{m}, \bar{G}^{(i)}\right)+k{\times}S\right] \end{aligned}} $$
(12)

where \(G_{m}^{(v)} = \bar {G}_{m-1}^{(v)}\cup f_{m}^{(v)} = \bar {f}_{1}^{(v)} \cup \bar {f}_{2}^{(v)} \cup...\bar {f}_{m-1}^{(v)}\cup {f}_{m-1}^{(v)}\), \({f_{m}^{(v)}}\) denotes a certain feature in the candidate feature set. For the v-th feature type G(v), a subset is obtained until other features can not provide additional information, as in the following equation.

$$ {\begin{aligned} &\left| \left(D\left(\left(\hat{F}^{G}_{sel} \cup \hat{G}^{v}_{m+1}\right),c\right) -R\left(\left(\hat{F}^{G}_{sel} \cup \hat{G}^{v}_{m+1}\right)\right)\right)\right.\\& \quad-\left.\left(D\left(\left(\hat{F}^{G}_{sel} \cup \hat{G}^{v}_{m}\right),c\right) -R\left(\left(\hat{F}^{G}_{sel} \cup \hat{G}^{v}_{m}\right)\right)\right)\right| \to 0 \end{aligned}} $$
(13)

where \(\hat {F}^{G}_{sel}=\hat {G}^{(1)} \cup \hat {G}^{(2)} \cup...\cup \hat {G}^{(v-1)}, \hat {G}^{(v)}_{(m+1)}=\hat {G}^{(v)}_{(m)}\cup \bar {f}_{m+1}^{(v)} \)

5 Results and discussions

In this section, extensive experiments are conducted to illustrate the effectiveness of SIP-FS in terms of predictability, stability, and interpretability. Four feature selection methods, mRMR, ReliefF, En-mRMR, and En-Relief, are used for performance comparisons on three publicly available datasets (two web image datasets named MIML [36] and NUS-WIDE-LITE [37], a remote sensing image dataset named USGS21 [38]). mRMR, ReliefF are commonly used filter methods, while En-mRMR, and En-Relief are two ensemble methods. One versus all strategy is adopted to select class-specific features for SIP-FS as well as other comparison methods.

For the three datasets, different types of feature are used followed by normalization individually. Libsvm [39] is used for training and classification. The images in each dataset are divided into two equal parts, in which one for training and the other for testing. Experiments are randomly repeated 10 times to report the average results.

5.1 Datasets

MIML consists of five classes, which are desert, mountain, sea, sunset, and trees. The number of five classes is 340, 268, 341, 261, and 378, respectively. Figure 2 shows sample images of this dataset. Eight types of feature (a total of 638 dimensions), color histogram, color moments, color coherence, textures, tamura-texture coarseness, tamura-texture directionality, edge orientation histogram, and SBN colors are used in experiments. The dimension of these features is 256, 6, 128, 15, 10, 8, 80, and 135, respectively.

Fig. 2
figure 2

MIML

NUS-WIDE-LITE contains images from Flickr.com collected by the National University of Singapore. In experiments, the images with zero label or more than one labels are removed, resulting in a single label dataset which contains nine classes: birds, boats, flowers, rocks, sun, tower, toy, tree, and vehicle, as shown in Fig. 3. Five types of feature (a total of 634 dimensions), color histogram, block-wise color moments, color correlogram, edge direction histogram, and wavelet texture are used for experimental evaluation. The dimension of these features is 64, 225, 144, 73, and 128, respectively.

Fig. 3
figure 3

NUS-WIDE-LITE

USGS21 contains 21 classes: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts, as shown in Fig. 4. Each class consists of 100 256×256-pixels images with the spatial resolution of one foot. Five types of feature (a total of 809 dimensions), color moment, HOG, Gabor, LBP, and Gist, extracted by [40] are used for evaluation. The dimension of these features is 81, 37, 120, 59, and 512, respectively.

Fig. 4
figure 4

USGS21

5.2 Effects of λ 1 and λ 2 on stability

In the proposed method, two parameters, λ1 and λ2, have influence on the performance of stability. λ1 determines the k value, which balances predictability and stability, while λ2 determines the proportion of subsample generation in iterative feature selection. Suitable combination of λ1 and λ2 is beneficial for obtaining consistent results.

The parameter tuning is conducted for each class individually. Figure 5 shows the influence of λ1 and λ2 on stability for three different classes, where λ1 is in the range of 0.0001, 0.001, 0.01, 0.1, 1, 10, and λ2 is in the range of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9, respectively. In general, high stability can be obtained using moderate λ1 value (e.g., 0.001, 0.01, and 0.1) and large λ2 (0.8 or 0.9), compared with other parameter combinations. The smaller λ1 value corresponds to better stability, yet the computational complexity is significantly increased. Small λ2 may result in high fluctuation of subsamples, leading to inconsistent selected results.

Fig. 5
figure 5

Effects of λ1 and λ2 on stability for three specific classes. a “trees” in MIML. b “flowers” in NUS-WIDE-LITE. c “building” in USGS21

5.3 Stability analysis

Tables 1, 2, and 3 show the stability comparisons of five methods on the three datasets. The stabilities of each class and the entire dataset (average) are given in these tables. The stability value ranges from 0 to 1, whereas, “0” and “1” represent the ranking of the selected results are completely inconsistent and consistent in randomly repeated feature selections, respectively.

Table 1 Stability comparisons on MIML
Table 2 Stability comparisons on NUS-WIDE-LITE
Table 3 Stability comparisons on USGS21

For Tables 1, 2, and 3, compared with other methods, SIP-FS significantly achieves stability improvement for each class (except for “dense residential” and “medium residential” shown in Table 3) as well as the entire dataset, indicating that SIP-FS helps select much more stable features.

In general, mRMR combined with ensemble strategy does not indicate significant improvement in terms of stability. Though ensemble strategy indicates slightly stability advantage for ReliefF, En-ReliefF performs worse than SIP-FS. Overall, SIP-FS performs best on the three datasets in terms of stability.

5.4 Interpretability analysis

Given a certain class, the prediction accuracy varies with feature types. How to select feature types and measure their effectiveness for a specific class are essential for interpretability analysis. In particular, one-versus-all strategy are combined with SIP-FS to select feature types (each contains a certain number of features) for a specific class. The effectiveness of these feature types for each class are measured by the relative contribution ratio, which are normalized by the respective maximum contribution [14].

Figures 6, 7, and 8 show the selected feature types for each class with the respective relative contribution ratio. For example, the selected feature types for “mountain” in MIML are shape and color features, as shown in Fig. 6. According to the relative contribution ratios, the selected feature types are edge orientation histogram, color coherence, color histogram, and SBN color. The most discriminative feature type is shape and the other three are color features (color coherence, color histogram, and SBN colors). However, some texture features (textures, tamura-texture coarseness, and tamura-texture) and redundant color feature (color moments) are removed. As shown in Fig. 7, color correlogram, edge direction (oriented) histogram, and wavelet texture provide complementary information for describing each class in NUS-WIDE-LITE dataset. In addition, block-wise color moments provide less information for most of classes in this dataset, while color moments are useless because of the information redundant. In USGS21 dataset, take the big class road (including freeway, overpass, and runway) and water (including bench and river) as two examples, as shown in Fig. 8. LBP is the most discriminative feature type for “road” while color moment is the most discriminative feature type for “water”. Furthermore, as a subclass of water, a river need additional complementary information provided by the other four feature types (LBP, Gabor, HOG and Gist) besides color moment. In general, SIP-FS provides a more interpretive data representation than other comparison methods.

Fig. 6
figure 6

Relative contribution ratios of features for each class of MIML

Fig. 7
figure 7

Relative contribution ratios of features for each class of NUS-WIDE-LITE

Fig. 8
figure 8

Relative contribution ratios of features for each class of USGS21

In short, the proposed SIP-FS method provides a more interpretable means for data representation than that of the existing feature selections. More useful information will become available, deepening the understanding of data.

5.5 Predictability analysis

Tables 4, 5 and 6 show the prediction accuracy of each class on the three datasets to evaluate the predictability, respectively. The predictability value ranges from 0 to 1, whereas, “0” and “1” represent completely misclassification and completely correct classification, respectively.

Table 4 Predictability comparisons on MIML
Table 5 Predictability comparisons on NUS-WIDE-LITE
Table 6 Predictability comparisons on USGS21

From Table 4, the predictability of mRMR four classes (e.g., mountains, sea, sunset, and trees) of MIML performs better than that of other methods. Although SIP-FS performs worse than mRMR in terms of average performance, it shows advantages than the other three methods. From Table 5, mRMR and SIP-FS perform best among all methods in terms of average performance. The comparison of mRMR and SIP-FS indicates that both methods have their own accuracy advantages on some classes. For example, the prediction accuracies of SIP-FS on boats, rocks, sun, and vehicle indicate advantages over that of mRMR. From Table 6, the average predictability performances of mRMR, En-mRMR and SIP-FS indicate significantly advantages over that of the others (ReliefF and En-ReliefF). It is worth noting that although En-ReliefF obtains the highest stability on “dense residential” and “medium residential” (as shown in Table 3), it has the lowest prediction accuracy (as shown in Table 6).

In general, SIP-FS and mRMR perform best among all comparison methods on the three datasets, demonstrating that it can maintain good predictability.

To further investigate the effect of the number of selected features on predictability performance, Fig. 9 shows the prediction accuracy of five feature selection methods on three different classes. In general, the prediction accuracy of the five methods tends to increase with the number of selected features increases. Desirable prediction results can be obtained by selecting the leading features, such as 20 (trees), 30 (flowers), and 20 (building) features, corresponding to Fig. 9ac.

Fig. 9
figure 9

The number of selected features on predictability performance. a “trees” in MIML. b “flowers” in NUS-WIDE-LITE. c “building” in USGS21

5.6 Trade-off between stability and predictability

In the section, stability-predictability tradeoff (SPT) is used to provide a formal and automatic way of jointly evaluating the trade-off between stability and predictability, as in ref. [29]. The definition of SPT is as follows.

$$ SPT = \frac{2\times \text{stability}\times\text{predictability}}{\text{stability} + \text{predictability}} $$
(14)

where stability (Tables 1, 2 and 3) and predictability (Tables 4, 5 and 6) denote the average performance. SPT ranges from 0 to 1, the higher the SPT, the better the performance. The SPTs for the three datasets are shown in Fig. 10. Several conclusions can be drawn from Fig. 10: (1) Compared with other methods, SIP-FS can obtain better tradeoff between stability and predictability. (2) mRMR and ReliefF combined with ensemble strategy indicates higher SPT than that without ensemble strategy.

Fig. 10
figure 10

SPT comparisons for three datasets

6 Conclusions

In this study, a novel feature selection method called SIP-FS is proposed to explore the stability and interpretability simultaneously while preserving predictability. Given a set of distinct feature types, the relation between different feature types is measured by minimal redundancy maximal relevance based on generalized correlation. Several feature types can then be selected and used to determine what types contribute to a specific class by quantitative evaluation. Furthermore, consistent results of ranking can be achieved through incorporating stability into the criterion of SIP-FS. The experiments on three datasets, MIML, NUS-WIDE-LITE, and USGS21, demonstrate that the performances of stability and interpretability are significantly improved without sacrificing predictability, compared with other filter and their respective ensemble-based methods. In future work, we intend to further investigate the selection of multi-modal information using SIP-FS.