Online group streaming feature selection using entropy-based uncertainty measures for fuzzy neighborhood rough sets

Online group streaming feature selection, as an essential online processing method, can deal with dynamic feature selection tasks by considering the original group structure information of the features. Due to the fuzziness and uncertainty of the feature stream, some existing methods are unstable and yield low predictive accuracy. To address these issues, this paper presents a novel online group streaming feature selection method (FNE-OGSFS) using fuzzy neighborhood entropy-based uncertainty measures. First, a separability measure integrating the dependency degree with the coincidence degree is proposed and introduced into the fuzzy neighborhood rough sets model to define a new fuzzy neighborhood entropy. Second, inspired by both algebra and information views, some fuzzy neighborhood entropy-based uncertainty measures are investigated and some properties are derived. Furthermore, the optimal features in the group are selected to flow into the feature space according to the significance of features, and the features with interactions are left. Then, all selected features are re-evaluated by the Lasso model to discard the redundant features. Finally, an online group streaming feature selection algorithm is designed. Experimental results compared with eight representative methods on thirteen datasets show that FNE-OGSFS can achieve better comprehensive performance.


Introduction
Feature selection, as an important data preprocessing technique, plays a key role in knowledge discovery, pattern recognition, and machine learning [1][2][3][4][5]. It  the optimal subset to improve classification accuracy and reduce computational complexity. Traditional feature selection methods are batch methods, assuming that the feature space is fixed [6]. However, the complete feature space is not available in real-world situations, such as high-resolution planetary image analysis during Mars rover operations [7], where obtaining the entire feature set in this scenario means using image data covering the entire surface of Mars, which is clearly not feasible. Therefore, dynamic feature selection has attracted the continuous attention of scholars in recent years [8][9][10][11][12].

Related work
Dynamic feature selection can be divided into feature selection with feature streams and data streams [13][14][15][16][17][18]. Online feature selection with a feature stream assumes that features flow into the feature space in batches over time, which can be further segmented into individual streaming feature selection and group streaming feature selection based on the structural information of the features [19][20][21][22].
Online individual streaming feature selection is characterized by features arriving in the feature space one by one [23][24][25][26][27][28]. Perkins et al. [29] first implemented feature selection in an online manner using a fast gradient-based heuristic. Wu et al. [30] proposed a streaming feature selection framework consisting of both online correlation analysis and online redundancy analysis. Zhou et al. [31] considered the interactions between features and efficiently selected features with interactions. Eskandari et al. [32] first proposed a new streaming feature selection method based on rough sets theory and improved it in [33]. Lin et al. [34] introduced fuzzy mutual information in multilabel learning to evaluate the quality of features for streaming feature selection scenarios. You et al. [35] proposed a causal feature selection algorithm for online streaming feature selection scenarios. The above methods can only handle the scenarios where features arrive one by one but ignore the original group structure of the features [36]. For example, in the fields of drug localization [37] and image analysis [38], where features mostly arrive in the feature space as groups.
Online group streaming feature selection can consider the structural information of the features, and thus achieve better classification results [39,40]. Li et al. [40] used entropy and mutual information to perform online group streaming feature selection. Wang et al. [41] proposed a framework for online feature selection using prior knowledge of group structure information, including intra-group feature selection and inter-group feature selection. Yu et al. [42] extended the scalable and accurate online feature selection approach to handle the group feature selection problem in the sparse case. Liu et al. [43] proposed a new framework based on group structure analysis for online multilabel group streaming feature selection. Zhou et al. [44] designed a new online group streaming feature selection algorithm focused on feature interaction based on mutual information. Unfortunately, the information present in the real world contains many subjective concepts, such as pretty, young, and moral, these concepts have no clear boundaries and thus create ambiguity and uncertainty [45][46][47][48][49][50]. Existing streaming feature selection methods cannot deal well with tasks in the context of fuzziness and uncertainty.
The classification task learns a classification model, i.e., a classifier, from the existing training samples [51][52][53]. When test data arrive, predictions can be made based on the classifier to map the new data items to one of the classes in the given category [52]. Recently, feature selection based on rough sets and fuzzy sets from algebra and information views has been frequently reported to measure uncertainty in classification tasks [54][55][56]. Wang et al. [57] proposed a fuzzy neighborhood rough sets model (FNRS) using parameterized fuzzy neighborhood information granules that can effectively prevent the effect of noise. Shreevastava et al. [58] proposed a new intuitionistic fuzzy neighborhood rough sets model for heterogeneous datasets, which combined intuitionistic fuzzy sets and neighborhood rough sets. An et al. [53] proposed a relative fuzzy rough set model and designed a classifier based on the maximum positive domain for the problem of large differences in the class density of the data distribution. The above studies discussed feature selection from an algebraic view, where the significance of features can only state the consequence of features contained in the feature subset [55,59]. Sang et al. [60] proposed a fuzzy dominant neighborhood rough sets model for possible noisy data in biased-ordered information systems. Xu et al. [61] redefined fuzzy neighborhood relations and introduced them into conditional entropy, proposing a new fuzzy neighborhood conditional entropy. Zhang et al. [62] proposed active incremental feature selection using the information entropy of introduced instances based on fuzzy rough sets. Nevertheless, the feature significance of these information view-based references merely interprets the influence of uncertainty classification on features [54,62]. Sun et al. [63] combined the fuzzy neighborhood rough sets with the neighborhood multigranulation rough sets and proposed a fuzzy neighborhood multigranulation rough sets model. Xu et al. [64] fused the self-information measure into the fuzzy neighborhood in the upper and lower approximations and proposed a fuzzy neighborhood joint entropy based on fuzzy neighborhood self-information. Xu et al. [65] defined multilabel fuzzy neighborhood conditional entropy and approximation accuracy to solve the classification problem under a multilabel decision system. It is confirmed that the combination of algebra and information views can make the measurement mechanism more comprehensive [63,65].
Fuzzy rough sets theory and its applications are also widely used to solve some practical problems [66][67][68][69][70][71][72]. Xu et al. [66] proposed a fuzzy rough uncertainty measure model for tumor diagnosis and microarray gene selection. Liu et al. [68] proposed a co-evolutionary model for crime inference based on fuzzy rough sets. These reported studies evaluate features by the consistency of conditional and decision features in information granularity, ignoring the separability of decision information granularity for different conditional features. Nonetheless, the separability of conditional features is closely related to their performance in classification tasks [73,74]. Hu et al. [73] defined the aggregation degree of intraclass objects and the dispersion degree of between-class objects to measure the significance of features. The separability-based evaluation function has a profound impact on the improvement of accuracy and time efficiency. However, these fuzzy rough sets-based approaches are only applicable to traditional feature selection and cannot deal with feature selection in dynamic environments.

Our work
Motivated by this, to effectively deal with the streaming feature selection task in fuzzy and uncertainty contexts, this paper proposes some fuzzy neighborhood entropy-based uncertainty measures and investigates a novel online group streaming feature selection method, named FNE-OGSFS. The main innovation points are as follows: -To better evaluate the classification quality of features in terms of separability, we define a new separability degree (SD) by integrating the coincidence degree and the dependency degree for fuzzy neighborhood rough sets and fuse it with FNRS to define a new fuzzy neighborhood entropy. Then, we propose the concepts of fuzzy neighborhood joint entropy, fuzzy neighborhood conditional entropy and fuzzy neighborhood mutual information. The related properties are explored and proven. -To better discuss the measure of online streaming feature selection from both algebra and information views, we propose fuzzy neighborhood symmetric uncertainty. Then, we present a series of uncertainty measures such as the significance, fuzzy neighborhood interaction gain and contrast ratio. The related theorems are derived and proven. Furthermore, we construct an online group streaming decision system to retain features with strong approximation ability when features dynamically flow into the feature space while removing redundant features. -Based on this, we design a new online group streaming feature selection algorithm, named FNE-OGSFS. First, the significance is used for intra-group feature selection. Second, online interaction analysis is performed on feature groups flowing into the feature space based on the fuzzy neighborhood interaction gain and contrast ratio. Finally, redundant features are removed using the Lasso model. Experimental results on thirteen different types of real-world datasets confirmed that FNE-OGSFS can effectively select the optimal feature subset.
The remainder of this paper is organized as follows.
"Preliminaries" reviews the related knowledge of FNRS and the coincidence degree. "Fuzzy neighborhood entropy-based uncertainty measures" presents the separability degree and some fuzzy neighborhood entropy-based uncertainty measures. "Online group streaming feature selection approach" develops a novel online group streaming feature selection approach. "Experimental results" provides the experimental analysis on thirteen datasets. "Conclusion and future work" concludes the paper with an outlook on the future.

Preliminaries
The FNRS is an effective model for feature selection and knowledge discovery. In this section, we review some basic concepts of fuzzy neighborhood rough sets. In addition, we introduce some basic knowledge related to the coincidence degree to facilitate the subsequent discussions.

Fuzzy neighborhood rough sets
Let DS = U , C, D be a decision system, where U = {x 1 , x 2 , . . . , x n } is a nonempty finite set called the theoretical domain, C is the set of conditional features of the sample, and D is the set of decision features of the sample. U /D = {D 1 , D 2 , · · · , D l } means D divides U into l equivalence classes.
Let A ⊆ C be a subset of conditional features on U , and a fuzzy binary relation R A can be induced by A. Then, R A is called a fuzzy similarity relation when it satisfies both reflexivity and symmetry [57]: Let a ∈ A and R a be the fuzzy similarity relation obtained by a. Then, R A can be expressed as R A = ∀a∈A R a . Definition 1 Given DS = U , C, D , let the fuzzy neighborhood radius parameter be δ (0 < δ ≤ 1), which is used to describe the similarity of samples, and for any x, y ∈ U , a ∈ C, the fuzzy neighborhood similarity relation between samples x and y with respect to feature a is denoted as where Δ denotes distance and equals to |f (a, x)− f (a, y)|.

Definition 2
Given DS = U , C, D , the fuzzy neighborhood similarity matrix of samples x and y with respect to feature a is [x] δ a (y) = R a (x, y), and for any A ⊆ C, , then the parameterized fuzzy neighborhood information granule of x with respect to A is expressed as Definition 3 Given DS = U , C, D , A ⊆ C, and U /D = {D 1 , D 2 , · · · , D l }, the fuzzy decision derived from D is denoted as is the fuzzy equivalence class of sample decisions. F D j (x) is the degree of membership and denoted by where |·| represents the cardinality.

Definition 4
Let A, B be two fuzzy sets on U . Then, the inclusion degree of A on B is expressed as where |A ⊆ B| denotes the number of samples whose membership degree on A is not greater than that on B.
Definition 5 Given DS = U , C, D , let β be a variable precision parameter, A ⊆ C and X ⊆ U . Then, the fuzzy neighborhood upper and lower approximations of X with respect to A are denoted, respectively, by Definition 6 Given DS = U , C, D , A ⊆ C, the fuzzy decision generated by D is FD = FD T 1 , FD T 2 , · · · , FD T l . Then, the fuzzy neighborhood positive region of D with respect to A is denoted as Definition 7 Given DS = U , C, D , A ⊆ C, the fuzzy neighborhood dependency degree of D in relation to A is expressed as

Coincidence degree
The purpose of feature selection is to retain the features with high separability and strong approximation ability and to remove the trivial features [73]. If the coincidence degree of the original data from different categories is high and the coincidence degree of the selected data is low, then the importance of the retained features is high.
where m a When m a ∩ m a D j , M a D j = 0, ϑ is a very small positive constant, and in other cases, ϑ = 0.

Definition 9
Let d be the number of D i = D j ; then, the coincidence degree of D with respect to a is expressed as CD (a) represents the coincidence degree of samples between different categories and evaluates the classification ability of feature a in terms of separability. In the process of feature selection, we need to select the feature that can decrease the coincidence degree.

Fuzzy neighborhood entropy-based uncertainty measures
To select features with high separability and strong approximation ability, this section defines a new separability measure and a new fuzzy neighborhood entropy. The feature selection method combining algebraic view and information view can achieve better classification results, and from this perspective, an uncertainty measure based on fuzzy neighborhood entropy is constructed and some related properties are derived.

Definition 10
Given DS = U , C, D , a ∈ C, the separability degree of D with respect to a is expressed as Let A ⊆ C and A = {A 1 , A 2 , . . . , A r }; then, the separability degree of D in regard to A is denoted by

Property 1 The smaller the value of S D δ
is only related to two parts: Dep δ A (D) and CD (A). From the properties of FNRS and Definition 10, it can be obtained that the larger the value of Dep δ A (D) is, the more important feature A is, and the smaller the value of S D δ A (D) is. Moreover, the nature of the coincidence degree shows that we need to choose the features that can reduce the overlap. In brief, the smaller the value of S D δ A (D) is, the higher the separability of A and the more important feature A is.

Definition 11 Given
and A ⊆ C, the fuzzy neighborhood entropy of A is defined as Definition 13 Given DS = U , C, D , A, B ⊆ C, the fuzzy neighborhood conditional entropy of A with respect to B is denoted as Definition 14 Given DS = U , C, D , A, B ⊆ C, the fuzzy neighborhood mutual information of A and B is represented as

Definition 15
Given DS = U , C, D , A, B ⊆ C, the fuzzy neighborhood symmetrical uncertainty of A and B is represented as In particular, let FNSU δ (A; D) Remark 1 Fuzzy neighborhood symmetrical uncertainty as a measure of uncertainty can better measure the significance of features. From Definition 15, we can see that S D δ A (D) denotes the separability degree from an algebraic view, and FNMI δ (A; B) represents the fuzzy neighborhood mutual information from an information view. Hence, FNSU δ (A; B) can measure the uncertainty from both algebra and information views.

Online group streaming feature selection approach
In this section, we first give a formalization of the problem of online group flow feature selection. Then, a new online group streaming feature selection algorithm is proposed based on the uncertainty measure proposed in the previous section, which includes three parts: intra-group feature selection, online interaction analysis, and online redundancy analysis. Finally, we perform a time complexity analysis of the proposed algorithm.

Problem formalization
T is a set of features in G containing m t features, and a new set of features G t is obtained with unknown feature space at each stamp t. D is the set of decision features, h is the feature-to-class mapping function, and t is the time stamp. The problem of online group streaming feature selection is to select an optimal feature subset S in the continuous inflow feature groups when the algorithm terminates.

Our new algorithm
The FNE-OGSFS can be divided into three parts: online intragroup selection and online interaction analysis, and online redundancy analysis.

Definition 17
Given that OGDS = (U , G, D, h, t), G t is the feature group arriving at time t, According to the above theory, a new intra-group streaming feature selection method is demonstrated in Algorithm 1.
end if 7: end for 8: return S t Let the feature group arriving at stamp t be G t . In Step 3, the significance of each feature in G t is calculated according to Formula (18), and if the value is greater than 0, it means the feature is important. Then, it will be selected to the feature subset S t in Step 5; otherwise, it will be discarded. If all features in G t are traversed, the algorithm will terminate and return the selected feature subset S t .

Online interaction analysis
. , x n , G t is the feature group arriving at time t, A, B ⊆ G t , then the fuzzy neighborhood interaction gain of B with respect to A is denoted as It is indicated that the information provided by A and B in one piece is more than the sum of the information provided by A and B alone, so A and B interact, and then B is called the interaction feature of A. (B; D). Thus, the information provided by A and B in one piece is no more than the sum of the information provided by A and B alone. Hence, B is an unnecessary feature of A.
If the newly arrived features in the group are interaction features, further redundancy analysis with the already selected features is needed.
where A i , A j ∈ A and i = j.
. It is demonstrated that for two candidate features A i and A j , A i can provide more information that is beneficial for classification; then, A i is more relevant to D. Therefore, A j is the redundant feature of A i . (A i ; D). Thus, A j can provide more information that is beneficial for classification; therefore, A j is more relevant in regard to D, A j is a relevant feature, and A i is the redundant feature.

Theorem 4 If FNCR
Next, all selected features are re-evaluated by the sparse linear regression model Lasso [75], and the features with similar labels are eliminated based on global group information.

Online redundancy analysis
Given OGDS = (U , G, D, h, t), U = {x 1 , x 2 , . . . , x n }, let the features that were selected in the above link be S = { f 1 , f 2 , . . . , f M }, let X ∈ R M×n be the dataset matrix and letρ ∈ R M be the projection vector. Then, the decision class vectorŷ ∈ R n is denoted aŝ Lasso chooses the bestρ by minimizing the following objective function: where · 1 indicates the L 1 norm of the vector, · 2 indicates the L 2 norm of the vector, and γ is the parameter that regulates the amount of regularization applied to the estimator, whose value is often determined by cross-validation. Lasso can effectively control the number of selected features by setting a part ofρ to zero to select features corresponding to nonzero coefficients and adding variable constraints based on the least square method. Based on all the above investigations, the FNE-OGSFS algorithm is proposed, and the corresponding pseudocode is shown in Algorithm 2. The code is available at https://github. com/SunY-H/OGSFS.
Let the set of features that have been selected at stamp t be S t−1 and the set of features selected within the group be S t . The fuzzy neighborhood interaction gain of each feature in S t relative to S t−1 is calculated in Step 3 according to Formula (19) when S t flows into the feature space. Based on Theorems 1 and 2, the feature is unnecessary if the interaction gain is not greater than zero; otherwise, the feature is an interaction and needs to be further analyzed for redundancy. In Step 11, the fuzzy neighborhood contrast ratio of each feature in S t−1 with respect to the feature is calculated depending on Formula (20). Based on Theorems 3 and 4, the feature is selected into feature subset S if the contrast ratio is greater than zero, while the corresponding redundant features in S t−1 are discarded; otherwise, the feature is discarded. If no new feature group flows into the feature space, the FNE-OGSFS algorithm terminates and returns the selected feature subset S after the online redundancy analysis.

Time complexity
For the FNE-OGSFS algorithm, the sample space is U = {x 1 , x 2 , . . . , x n }, the set of features arriving at time t is G t = f 1 , f 2 , . . . , f m t T , the set of selected features is S t−1 , and the set of selected features within the group is S t . Each feature in G t is traversed to calculate its significance in the intra-group feature selection phase, where the computation of /* online intra-group selection */ 5: /* online interaction analysis */ 7: for each feature f i in S t do 8: Compute for each feature f k in S t−1 do 11: Compute end for 20: until no more group arrive 21: /* online redundancy analysis */ 22: S ← find the global optimal subset by Lasso algorithm 23: return the optimal feature subset S the parameterized fuzzy neighborhood information granule has the most important impact on the complexity and the time complexity is approximated by O (m t n). The time complexity of the inter-group feature selection phase increases with the size of the selected feature subset S t−1 . Let the number of features in S t−1 be |S t−1 | and the number of features selected within the group be S t ; then, the worst-case time complexity is O |S t−1 | * S t * n 2 . In addition, the time complexity of the Lasso algorithm is O (n). Therefore, the worst-case time complexity of the FNE-OGSFS algorithm is

Experimental results
The desired effect of our algorithm is to efficiently select a smaller subset of features and obtain a higher classification accuracy. In this section, we conduct a series of experiments based on some existing algorithms and datasets. To provide detailed information regarding the experiments, this section first describes the experimental preparation and then shows the effect of different parameters on the classification performance. By comparing FNE-OGSFS with some popular algorithms, we validate the effectiveness of the proposed algorithms. Finally, we conduct statistical tests on the experimental results.

Experiment setup
The evaluation framework for feature selection is outlined in Fig. 1. The details of each stage of the experiment are depicted below. First, the dataset is divided and preprocessed. To verify the feasibility and stability of the developed algorithm, the FNE-OGSFS method is used on thirteen public datasets in our experiments, including four UCI datasets, 1 seven DNA microarray datasets 2 and two NIPS 2003 datasets. 3 All datasets in detail are listed in Table 1. For the missing values in a dataset, such as the LYMPHOMA dataset, the conditional mean completer is used [16]. The 10-fold cross-validation approach is adopted for evaluating the classification performance under different classifiers.
Second, the feature selection methods are selected. In this subsection, FNE-OGSFS is compared with eight stateof-the-art feature selection methods, including two online group streaming feature selection methods (OGSFS-FI [44], Group-SAOLA [42]), three online individual streaming feature selection methods (Alpha-investing [25], SFS-FI [31], OFS-A3M [26]) and three FNRS-based method (FNRS [57], FNCE [67], FNPME-FS [63]). Note that the FNRS-based methods cannot deal with the online feature selection task, and some parameters contained in the above comparison methods need to be specified in advance; here, we refer to the parameter value or value range corresponding to the original thesis.
Finally, the classifier and evaluation metrics are determined. Four classical classifiers, including support vector

The parameter analysis of the FNE-OGSFS
There are two parameters and G in the FNE-OGSFS algorithm. The parameter is used to adjust the size of the fuzzy neighborhood, and the parameter G is applied to control the size of the group. We set the value of from 0.1 to 1 with an interval of 0.05 [60]. Since the dataset does not have a priori group structure information, the experiment obtains the information on the group structure through the artificially specified group size to improve the time efficiency. The values of G are set to 5, 10, 20, 30, and 60 for low-dimensional datasets and 50, 100, 200, 400, and 800 for high-dimensional datasets [44]. In this subsection, we focus on the effect of different parameters on the predictive accuracy, number of selected features and running time.
In terms of predictive accuracy, the variation in predictive accuracy with parameters for thirteen datasets on SVM is shown in Figs. 2 and 3, where four datasets in Fig.  2 are low-dimensional datasets and nine datasets in Fig.  3 are high-dimensional datasets. The experimental results obtained using KNN, NB, and CART are roughly consistent with SVM. Figure 2 indicates that the different parameters have a certain impact on the classification performance of low-dimensional datasets. In detail, the parameter has a deeper influence on some datasets, such as the Sonar and Ionosphere datasets, where the predictive accuracies are generally higher when is less than 0.5. The classification performance of the Wdbc and Wpbc datasets obviously depends more on the parameter G, which can achieve higher predictive accuracies when G is larger, and the influence of is not significant. As seen in Fig. 3, the predictive accuracy on most of the high-dimensional datasets has a significant trend change with parameter changes. The DLBCL, LYMPHOMA and Lung Cancer datasets can achieve better predictive accuracies when both G and are larger. The predictive accuracies of the LEUKEMIA, Ovarian Cancer, ARCENE and MADE-LON datasets are more influenced by parameter G; when parameter is constant, the predictive accuracy improves significantly with increasing G. The applicability of parameter varies greatly for different datasets, e.g., dataset COLON is more suitable for smaller , and dataset SRBCT is more suitable for larger . Obviously, all datasets can achieve higher predictive accuracy in most regions.
In terms of running time and number of selected features, due to space limitation, three different types of datasets (Wpbc, LEUKEMIA, and ARCENE) are selected as representatives in this subsection to test the experimental effects under different parameters, and the results are shown in Figs. 4 and 5, respectively. Figure 4 shows that the running time increases significantly only when is small in the low-dimensional dataset. The running time and parameter G are closely related in the high-dimensional dataset. This is because as the group size increases, more complex matrix operations occur in the calculation of the fuzzy neighborhood information granule, which in turn consumes more time.  Overall, the experimental result demonstrates the effectiveness of FNE-OGSFS in selecting the optimal feature subsets for different types of datasets. It should be noted that the parameters corresponding to the best predictive accuracy are different for the thirteen datasets. Therefore, the parameters need to be determined in advance before feature selection for the datasets to achieve the maximum balance among higher accuracy, smaller running time, and greater compactness.

Comparison with other algorithms
In this subsection, the performance of FNE-OGSFS and its rivals in terms of the predictive accuracy, number of selected features and running time are analyzed. Tables 2, 3, 4 and 5 show the predictive accuracies of the KNN, SVM, NB, and CART classifiers. The last two rows list the win/tie/lose (abbreviated as W/T/L) counts and the average predictive accuracies of the algorithms on all datasets, with bold font indicating the highest predictive accuracy. Tables 6 and 7 show the number of selected features and running times of the nine algorithms, respectively. Specifically, we discuss the following.
To more intuitively confirm the algorithm effectiveness, we plotted spider web graphs to depict the average predictive accuracy on each classifier, as shown in Fig. 6, where the red line represents the predictive accuracy of our proposed algorithm on each dataset. Tables 2, 3, 4 and 5 and Fig. 6 show that FNE-OGSFS performs significantly better than the other comparison algorithms in terms of overall classification performance. The average predictive accuracy reaches the maximum on all classifiers, and the win counts achieve the highest among all comparison algorithms. By intra-group feature selection, our proposed algorithm select features with high significance. During the online interaction analysis, FNE-OGSFS leave the features with interaction. The experimental results show that this strategy is effective and the selected features can achieve high prediction accuracy. Com- pared with online streaming feature selection methods, the FNE-OGSFS method performs significantly better on genetic datasets. Because the algorithm can handle the fuzziness and uncertainty of genetic datasets well by using the uncertainty measures based on fuzzy neighborhood symmetric uncertainty. Compared with the FNRS-based feature selection methods, our algorithm has a significant advantage because our algorithm incorporates the coincidence degree that allows the selection of highly separability features, which is beneficial for classification. Compared with the number of selected features, we find that the proposed algorithm can achieve feature reduction. In the last part of FNE-OGSFS, namely, online redundancy analysis, redundant features can be effectively eliminated, which is helpful in selecting fewer features. Although Group-SAOLA and SFS-FI select fewer features, they are far less accurate than our proposed algorithm and other comparison algorithms. The features removed in the redundancy analysis phase of our algorithm also do not degrade the classification performance.
By comparing the time used by the algorithms, FNE-OGSFS can achieve high efficiency when dealing with low-dimensional datasets. However, its performance is poor when dealing with high-dimensional datasets such as the Lung Cancer and Ovarian Cancer datasets because the process of computing fuzzy neighborhood granules consumes considerable time. This situation is more evident in all three comparison algorithms based on FNRS. Our algorithm requires re-evaluation of the selected features and thus runs slower. However, since we evaluate the interactivity and redundancy of the selected features, we select more favorable and fewer features for classification, which performs better in terms of compactness and accuracy.
In conclusion, FNE-OGSFS provides the best overall performance on four classical classifiers. Although FNE-OGSFS has a slightly longer running time, it achieves the Table 2 Comparison of predictive accuracies on classifier KNN

Statistical significance analysis
To further explore the generalization ability of the FNE-OGSFS algorithm systematically, the Friedman test and Nemenyi post-hoc test are performed in this subsection to assess the statistical significance of the algorithm [76]. The average predictive accuracies of each algorithm on the thirteen datasets shown in Tables 4, 5, 6 and 7 are ranked from lowest to highest before using the Friedman test, and the rankings are divided equally when the performance is the same. The Friedman statistic is described as where n and h are the number of datasets and algorithms, respectively, and R j ( j = 1, 2, . . . , h) denotes the average ranking of the j th algorithm over all datasets. The variable χ 2 F obeys the χ 2 distribution with h − 1 degrees of freedom, and the variable F F obeys the F distribution with h − 1 and (h − 1) (n − 1) degrees of freedom. To further obtain the difference between the algorithms, the critical difference (CD) of the mean rankings in the Nemenyi test is calculated by where α represents the significance level of the Nemenyi test and q α represents the critical value corresponding to the number of comparison algorithms at a particular significance level.
The average rankings of predictive accuracy of a particular algorithm on all datasets were acquired according to the statistical tests provided in [76]. For the predictive accuracy on thirteen datasets in Tables 2, 3, 4 and 5, the Friedman tests were achieved by the comparison of FNE-OGSFS with the other algorithms. The null hypothesis of the Friedman test is established when all algorithms are equal in metrics of predictive accuracy. Table 8 describes the average ranking of the nine algorithms and the values of χ 2 F and F F on the four classifiers.
The F F distribution has 8 and 96 degrees of freedom when n = 13 and h = 9, respectively. By checking the table, we can obtain the value of χ 2 F (8) in the χ 2 F distribution as 13.36   Table 8, we can see that the values of χ 2 F and F F on four classifiers are greater than their values on χ 2 F (8) and F (8,96). That is, all null hypotheses are rejected, which demonstrates that the performances of these algorithms are significantly different. Further post-hoc tests are performed next to obtain the difference between the algorithms. The critical value q 0.1 is 2.855 when h = 9 and then the critical range C D 0.1 is 3.0663. To more intuitively compare the differences of the algorithms, a graph is introduced to connect the methods that do not differ significantly from each other, in which the critical values among all algorithms can be clearly illustrated. Fig. 7 shows the comparison of FNE-OGSFS with the other algorithms on four classifiers, where the critical value and its range are shown above the axis, the coordinate axis plots the average ranking values for each algorithm, and the average ranking of the left-hand side is the lowest. The horizontal lines are used to connect the algorithms with no significant difference, which indicates that any two algorithms with a difference in average ranking less than the value of CD are connected by the red line.
As shown in Fig. 7, the significant difference among the nine algorithms are obvious. FNE-OGSFS performs significantly better than the other algorithms on four classifiers. In some cases, the significance of the algorithms on different classifiers is slightly different, e.g., Group-SAOLA has the lowest average ranking on NB and CART classifiers, but not on the other classifiers. FNE-OGSFS has the same group as FNPME-FS and OGSFS-FI, which means the differences among the three algorithms are not obvious. However, the difference in their average rankings is very close to the critical value on most classifiers, so it can still be concluded that FNE-OGSFS excels against the two compared methods. In summary, FNE-OGSFS outperforms the other eight compared algorithms overall.

Conclusion and future work
In this paper, we proposed a novel online group streaming feature selection method, named FNE-OGSFS. First, a new separability measure was investigated, and some fuzzy neighborhood entropy-based uncertainty measures were expanded, inspired by both algebra and information views. Second, intra-group feature selection was performed according to the significance of features. Then, interactive feature selection was devised in an online manner for features that flow into the feature space. Finally, the Lasso model was applied to online redundancy analysis. Compared to some state-of-the-art online streaming feature selection methods and traditional feature selection methods based on FNRS, FNE-OGSFS demonstrated better comprehensive performance.
In future work, we will further optimize the method, focusing on how to select the best parameters automatically and improve the computational efficiency of the algorithm and achieve an optimal balance on high-dimensional datasets. Moreover, research on incremental feature selection with feature streams and data streams based on fuzzy neighborhood rough sets will receive more attention.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.