1 Introduction

In the traditional machine learning framework, each sample corresponds to only one label; this is called single-label learning. Single-label learning is the most well-studied and widely used machine learning framework [1]. In single-label learning, an instance in the learning framework describes the properties of each real-world object, and the instance is associated with the class label of the semantic object to form a sample. Single-label learning has achieved good results in the single-label learning domain when the target instance has an explicit single class label.

In the real world, however, there is often more than one unique semantics. In fact, most objects are associated with more than one concept at the same time [2,3,4,5]. For example, an elderly patient may suffer from several diseases, including diabetes, hypertension, and coronary heart disease; a picture of a tiger in a forest may be associated with multiple keywords, such as “tiger” and “tree.” Because polysemantic objects no longer have a single semantic meaning, a single-label learning framework that only considers a single explicit semantics is unlikely to achieve good results. To reflect this problem intuitively for polysemous words, the most obvious approach is to assign multiple category labels to each example of a polysemous word. This set of category labels is referred to as a subset of labels. The learning paradigm that uses the multi-label approach to labeling sample examples is known as multi-label learning [6]. Multiply labeled objects are ubiquitous in all areas of life. The multi-label learning paradigm has been widely used in text classification [7, 8], bioinformatics [9, 10], sentiment recognition [11, 12], and information retrieval [13, 14].

Similarly to traditional single-label learning, multi-label learning has faced many challenges. With respect to the data structure of labeled instances, problems faced by multi-label tasks include large feature dimensionality [15, 16], large numbers of labels [17], label imbalance [18], and flow features [19]. For multi-label learning tasks, the dimensionality of multi-label data is large, often with thousands or tens of thousands of features [20, 21]. For a given learning task, a large proportion of these high-dimensional features may be redundant or invalid. High-dimensional data may cause various problems for learning, including overfitting, longer computation time, and higher memory consumption, compared with single-label data [22,23,24,25]. Therefore, reducing the dimensionality of the labeling task is a priority. Multi-label dimensionality reduction is a data preprocessing technique that can be used to remove redundant and irrelevant features and reduce the dimensionality of high-dimensional features. Common methods that have been proposed for multi-label dimensionality reduction include LDA [26], MDDM [27], MLST [28], PUM [29], and PL-ML [30]. Of these multi-label dimensionality techniques, multi-label feature selection methods have received much attention.

There are two main methods for multi-label dimensionality reduction: multi-label feature extraction and multi-label feature selection. Multi-label feature extraction methods, such as LDA, MDDM, and MLST, reduce the dimensionality of the feature space using spatial mapping techniques or spatial transformations, but these destroy the structural information of the original feature space, obscure the physical meaning of the features, and lack a semantic interpretation. Although methods such as PUM and PL-ML can improve learning performance, they all share a common limitation: a complete feature set needs to be collected before feature selection, and no attention is paid to the correlation between labels. In contrast to multi-label feature extraction methods, feature selection methods do not perform any feature space transformation or mapping; instead, they preserve the original spatial structure. A feature selection method selects a subspace that best represents the semantic features of the feature space by ranking the features by importance in the original feature space, and uses this subspace to represent the original feature space to the greatest extent possible [31, 32]. Thus, multi-label selection methods preserve the physical meaning of the feature space well, which is an advantage over feature extraction methods [33]. As the amount of multi-label data has increased, many feature selection methods for multi-label learning have been developed. These methods fall into three main categories: filters, wrappers, and embeddings. A filter first selects the features and then trains the classifier, so the feature selection process is independent of the classifier. This is equivalent to filtering the features first and then training the classifier with a subset of the features [34, 35]. Wrappers directly use the final classifier as the evaluation function for feature selection, to choose the optimal subset of features for a given classifier. Wrapper methods rely on a predetermined classifier to directly select a subset of features; these methods require multiple runs of the classifier to evaluate the quality of the selected features, and they are often computationally expensive [36, 37]. Embedding methods combine the process of feature selection with the process of classifier learning, in which feature selection is performed during the learning process. Embedding methods find a subset of features by the joint minimization of empirical errors and penalties, which can be approximated as a continuous optimization problem. To remove irrelevant and noisy features, the feature selection matrix is usually used for sparse regularization [38]. Filtering methods are independent of the specific learning task and model. Among other advantages, filtering methods are usually more efficient, less computationally expensive, and more general than embedded models; therefore, we focus on filtering methods in this paper. Most existing filtering algorithms for multi-label problems convert the multi-label problem to a single-label problem. Lee et al. [29] proposed a method that converts multiple labels to multiple binary single labels, and then used an evaluation method to evaluate each feature of each label individually to obtain a global feature ranking. However, this approach ignores the inherent correlation in multi-label data and the connection between labels and features. Doquire et al. [39] converted multiple labels to a single label consisting of multiple classes, and then solved the feature selection problem for multi-class single labels. However, this method may dramatically increase the complexity of the feature selection problem.

In contrast to single-label learning, because an instance in multi-label learning corresponds to multiple labels, these labels are often interrelated and interdependent. For example, in a set of instances in which the object is an image, “animal” and “nature” often appear in the same image; in a set of instances in which the object is a document, a document is often associated with multiple topics, such as “politics” and “economics.” However, existing multi-label feature selection methods usually fail to consider label importance or correlation between labels. When we perform feature selection, we can focus on the correlation between tags and use this correlation to better select features [40,41,42]. Yu et al. [18] constructed a multi-label classification method based on the uncertainty between feature space and label space. Elisseeff and Weston [9] proposed a large-margin ranking system, which shares many properties with support vector machines, to learn the ranks of labels for each instance. This paper proposes a neighborhood rough set (NRS) model based on label weights and label correlations, which can effectively perform feature selection for multi-label problems.

In most practical applications of multi-label learning, the feature space is usually uncertain, with the same number of samples for each feature arriving in the feature space incrementally, as a flow of feature vectors over time. Such features are known as flow features. For example, on the social networking platform Twitter, trending topics continuously change dynamically over time. When a trending topic appears, it is always accompanied by a fresh set of keywords. These fresh keywords can be used as key features to distinguish trending topics. Multi-tag flow feature selection assumes that features arrive dynamically over time [43] and feature selection is performed as each feature arrives, to maintain an optimal subset of features at all times [44, 45]. Many researchers have attempted to address the challenges posed by flow features. For example, Zhang et al. [46] proposed the use of global features to process flow feature data. Yu et al. [47] conducted a theoretical analysis of the pairwise correlation between features in the currently selected feature subset, and adopted online pairwise comparison techniques to solve the problem of flow features. An online flow feature must satisfy three basic conditions. First, it should not require any prior knowledge to be provided. Second, it should support efficient incremental updates to the selected features. Third, it should be able to make accurate predictions at each update. In this paper, we mainly use NRS to select streaming feature data. The main motivation is that NRS can process mixed types of data without destroying the neighborhood and order structure of the data. In addition, feature selection based on NRS does not require any prior knowledge of the feature space structure, and therefore seems to be an ideal tool for online streaming feature selection.

The main contributions of this paper are the following:

  1. 1.

    A new form of neighborhood granularity is calculated using the average nearest neighbor method and the label weights are calculated using the mutual information method to obtain the label correlation. The neighborhood granularity and label correlation are combined to construct a new NRS relationship and feature importance model.

  2. 2.

    The traditional NRS model is generalized to adapt it to multi-label learning. We propose a static multi-label feature selection algorithm based on the above NRS.

  3. 3.

    We propose a new multi-label flow feature selection algorithm that combines a static multi-label feature selection algorithm with an online importance update framework.

The rest of the paper is organized as follows. Section 2 introduces related concepts, including multi-label learning and NRS. Section 3 presents an NRS model based on label weights and label correlations, including a static algorithm and a dynamic flow feature algorithm. We report our experimental results in Sect. 4 and present our conclusions in Sect. 5.

2 Preliminaries

2.1 Multi-label learning

\(N D T=\langle U, F, L\rangle\) is a multi-label decision system. We define \(X=R^{N}\) to represent N-dimensional sample space, which is \(U=\left\{ x_{1}, x_{2}, \ldots , x_{n}\right\} \cdot F=\left\{ f_{1}, f_{2}, \ldots , f_{m}\right\}\) represents the m-dimensional feature space and \(L=\left\{ l_{1}, l_{2}, \ldots , l_{k}\right\}\) represents the k-dimensional label space. For instance, in the sample space \(x_{i} \in U, x_{i}=\left\{ F_{i 1}, F_{i 2}, \ldots , F_{i m}\right\}\) represents a specific m-dimensional feature vector corresponding to the sample \(x_{i}\), and \(y_{i} \in Y=L\) represents a k-dimensional label vector \(y_{i}=\left\{ y_{i}^{1}, y_{i}^{2}, \ldots , y_{i}^{k}\right\}\) corresponding to \(x_{i}\). The task of multi-label learning is to find a mapping \(f{:}\, X \rightarrow Y\): when \(x_{i}\) contains the label \(l_{i}\), the corresponding value of \(y_{i}\) is 1; otherwise, it is −1. That is, when \(y_{i}^{k}=1\), the sample \(x_{i} \in l_{k}\) is the label category [48].

2.2 Neighborhood rough set

Given a decision system \(N D T=\langle U, C, D\rangle , U=\left\{ x_{1}, x_{2}, \ldots , x_{n}\right\}\) represents a non-empty set of instances, that is, the set composed of all samples, \(C=\left\{ a_{1}, \ldots , a_{N}\right\}\) represents the attribute set corresponding to the sample, and D represents the set of decision attributes.

For a given parameter \(\delta\) and feature set C, the \(\delta\)-domain relationship on X can be determined. We call the decision system a neighborhood decision system: NDS \(=<\) \(U, C \cup D, \delta>\).

Definition 1

Given an N-dimensional real space \(\Omega , \Delta {:}\, R^{N} \times R^{N} \rightarrow R\), we say that \(\Delta\) is a metric on \(R^{N}\) if \(\Delta\) satisfies the following constraints:

  1. (1)

    \(\Delta \left( x_{1}, x_{2}\right) \ge 0\), if and only when \(x_{1}=x_{2}, \forall x_{1}, x_{2} \in R^{N}\);

  2. (2)

    \(\Delta \left( x_{1}, x_{2}\right) =\Delta \left( x_{2}, x_{1}\right) , \forall x_{1}, x_{2} \in R^{N}\);

  3. (3)

    \(\Delta \left( x_{1}, x_{3}\right) \le \Delta \left( x_{1}, x_{2}\right) +\Delta \left( x_{2}, x_{3}\right) , \forall x_{1}, x_{2}, x_{3} \in R^{N}\)

Definition 2

For \(\forall x_{i} \in U\) and a feature subset \(B \subseteq C\), we define the \(\delta\)-neighborhood of \(x_{i}\) based on parameter C as:

$$\begin{aligned} \delta _{B}\left( x_{i}\right) =\left\{ x_{j} \mid x_{j} \in U, \Delta _{B}\left( x_{i}, x_{j}\right) \le \delta \right\} \end{aligned}$$
(1)

where \(\delta >0\), by \(\Delta _{B}\left( x_{i}, x_{j}\right)\), the set of instances of values is granulated. We call \(\left\langle \Omega , \Delta _{B}\right\rangle\) the metric space, and \(\delta _{B}\left( x_{i}\right)\) the \(\delta\)-neighborhood information particle generated by \(x_{i}\). In this manner, we granulate the neighborhood of all objects in the universal space.

From the neighborhood information particle clusters, \(\left\{ \delta \left( x_{i}\right) \mid i=1,2, \ldots , n\right\}\) can lead to a neighborhood relation N on the universal space U. This relation can be represented by a matrix system \(M(N)=\left( r_{i j}\right) _{n \times m}\): if \(x_{j} \in \delta \left( x_{i}\right)\), then \(r_{i j}=1\); otherwise, \(r_{i j}=0\). For neighborhood relations, we have

  1. (1)

    \(\forall x_{i} \in U{:}\,\delta _{1}\left( x_{i}\right) \subseteq \delta _{2}\left( x_{i}\right)\)

  2. (2)

    \(N_{1} \subseteq N_{2}\)

The neighborhood information particle clusters defined in this manner constitute the basic concept system in the universal space.

Definition 3

Given a non-empty finite set \(U=\left\{ x_{1}, x_{2}, \ldots , x_{n}\right\}\) on the actual space and a neighborhood relation N on U, we call the two-tuple \(N A S=\langle U, N\rangle\) a neighborhood approximation space.

Definition 4

For a given decision system \(N D T=\langle U, C, D\rangle\) and \(X \subseteq N\), the lower approximation and upper approximation of X in the neighborhood approximation space \(N A S=\langle U, N\rangle\) are defined as [49]

  1. (1)

    \({\underline{N}} X=\left\{ x_{i} \mid \delta \left( x_{i}\right) \subseteq X, x_{i} \in U\right\}\)

  2. (2)

    \({\bar{N}} X=\left\{ x_{i} \mid \delta \left( x_{i}\right) \cap X \ne \emptyset , x_{i} \in U\right\}\)

respectively, where \({\underline{N}} X\) is also referred to as the positive domain of X in the approximation space \(N A S=\langle U, N\rangle\), which is the largest union of neighborhood information particles that can be completely contained in X.

Definition 5

For a neighborhood decision system \(N D T=\langle U, A, D, \delta \rangle , D\) partitions U into N equivalence classes: \(X_{1}, X_{2}, \ldots , X_{N} \cdot \forall B \subseteq A\), we define the upper and lower approximations of the decision attribute D with respect to B as

$$\begin{aligned} \underline{N_{B}} D=\bigcup _{i=1}^{N} \underline{N_{B}} X_{i} \end{aligned}$$
(2)
$$\begin{aligned} \overline{N_{B}} D=\bigcup _{i=1}^{N} \overline{N_{B}} X_{i} \end{aligned}$$
(3)

respectively [50], where \(\delta _{B}\left( x_{i}\right)\) is the informative neighborhood particle generated by attribute B and metric \(\Delta\).

The lower approximation of decision attribute D, also called the decision-positive region, is denoted by \({\text {POS}}(D)\).

The size of the positive region reflects the degree to which the classification problem is separable in a given attribute space, with larger positive regions indicating areas of overlap (i.e., fewer boundaries) for each category. We can describe such classification problems in more detail using this set of attributes.

$$\begin{aligned} {\text {POS}}(D)=\left\{ x_{i} \mid \delta _{B}\left( x_{i}\right) \subseteq D, x_{i} \in U\right\} \end{aligned}$$
(4)

Definition 6

Suppose that AB are two sets; we define the degree to which A is contained in BI(AB), as follows [51].

$$\begin{aligned} I(A, B)=\frac{{\text {Card}}(A \cap B)}{{\text {Card}}(A)} \end{aligned}$$
(5)

When \(A=\emptyset\) or \(B=\emptyset\), we define \(I(A, B)=0 \cdot I(A, B)\) reflects the importance of B to A.

The dependency of decision attribute D on condition attribute B is defined as follows [52]:

$$\begin{aligned} \gamma _{B}(D)={\text {Card}}\left( \underline{N_{B} D}\right) / {\text {Card}}(U) \end{aligned}$$
(6)

where \(\gamma _{B}(D)\) denotes the proportion of samples in the sample set that can be included by a decision according to the description of condition attribute B.

The positive region of the decision is larger if the decision attribute D is more dependent on the condition attribute B.

3 Proposed method

3.1 Improvements to neighborhood particles based on average nearest neighbors

Given a decision system \(N D T=\langle U, C, D\rangle , U=\left\{ x_{1}, x_{2}, \ldots , x_{n}\right\}\) represents a non-empty set of instances, C represents the feature set corresponding to the instance set, and D represents the decision attribute set. The traditional single-label method for neighborhood information particle division is unsuitable for multi-label data. For general data, a group of instances with the same attribute value or label value is called an equivalence class. Similarly, for mixed data, a group of instances with similar attribute values or label values is called a neighborhood class. In this paper, the margin of particles in the sample is used for granulating the neighborhood size.

Definition 7

Given a sample x, the margin of x relative to a set of samples U is defined as follows:

$$\begin{aligned} m(x)=\Delta (x, N S(x))-\Delta (x, N T(x)) \end{aligned}$$
(7)

where NS(x) denotes the instance from U that has the shortest distance from x and whose label class is different from that of x and NT(x) denotes the instance from U that has the shortest distance from x and has the same label class as x; we call these instances the nearest miss and the nearest hit, respectively. \(\Delta (x, N S(x))\) denotes the distance between x and NS(x), and \(\Delta (x, N T(x))\) denotes the distance between x and NT(x). We call \(\delta (x)=\{y \mid \Delta (x, y) \le m(x)\}\) the neighborhood particle about x. To facilitate the setting of neighborhood information particles, we set \(m(x)=0\) when \(m(x)<0\).

A sample may have a positive or negative effect on different labels. Thus, for a given sample, the degree of granularity may depend on the label used.

Definition 8

For a sample x and label \(l_{k} \in L\), the margin of x with respect to \(l_{k}\) is

$$\begin{aligned} m_{l_{k}}(x)=\Delta _{l_{k}}\left( x, N S_{l_{k}}(x)\right) -\Delta _{l_{k}}\left( x, N T_{l_{k}}(x)\right) , l_{k} \in L \end{aligned}$$
(8)

As noted above, each sample has a different label and correspondingly a different granularity. Depending on the different decision views, we need to combine all the single-label granularities of a given sample to form a multi-label granularity [53]. Therefore, in this paper, we choose the average granularity (i.e., the average nearest neighborhood, also known as the neutral view) to represent the multi-label granularity of a sample [54].

$$\begin{aligned} m^{n e u}(x)=\frac{1}{L} \sum _{i=1}^{L} m_{l_{k}}(x) \end{aligned}$$
(9)

To solve the problem of the granularity selection of \(\delta\), combining Eqs. 1 and 9, the new neighborhood of the sample is defined as

$$\begin{aligned} \delta _{B}\left( x_{i}\right) =\left\{ x_{j} \mid x_{j} \in U, \Delta _{B}\left( x_{i}, x_{j}\right) \le m^{n e u}\left( x_{i}\right) \right\} \end{aligned}$$
(10)

We have defined a new neighborhood information particle to solve the problem of selecting the neighborhood granularity, which is caused by multi-label data. In addition, the average nearest neighbor reflects the relationship between features in an instance. This new neighborhood model considers the relationships between features and is based on improved neighborhood information.

3.2 Label correlation

Definition 9

In the neighborhood decision system \(N D S=\langle U, C \cup D, \delta \rangle\), for any instance \(x_{i}, y_{i}=\left\{ y_{i}^{1}, y_{i}^{2}, \ldots , y_{i}^{m}\right\}\) is its corresponding label vector, and \(l_{j}\) is a label in the label space \(L=D\). When \(x_{i}\) belongs to category \(l_{j}\), the corresponding value of \(y_{i}^{j}\) is 1. We define \(D^{j}=\left\{ x_{i} \mid \forall x_{i} \in U, y_{i}^{j}=1\right\}\), that is, the set of all instances in U that belong to category \(l_{j}\). Through the definition of multi-label decision space, we can expand the decision-positive region of single-label decision making, using Eq. 4. For a certain feature subset \(B \subseteq C\), the lower approximation of the decision \(l_{j}\) about B is

$$\begin{aligned} {\text {POS}}\left( D^{j}\right) ^{\prime }=\underline{N_{B}} D^{j^{\prime }}=\left\{ x_{i} \mid \delta _{B}\left( x_{i}\right) \subseteq D^{j}, x_{i} \in U\right\} \end{aligned}$$
(11)

Multi-label data differs from single-label data in that it is necessary to consider the importance of the labels and the correlation between them because the labels of each instance are always somehow related.

Definition 10

For a sample \(x_{i}\) and the corresponding feature vector \(Y_{i}\), that is, \(D=\) \(\left\{ \left( x_{i}, Y_{i}\right) \mid 1 \le i \le N, x_{i} \in U, Y_{i} \in L\right\} , N\) is the number of instances in the training set and \(l_{i}, l_{j} \in L(1 \le i, j \le k)\) are any two labels in the label space L. The correlation between \(l_{i}\) and \(l_{j}\) is calculated by mutual information:

$$\begin{aligned} M I\left( l_{i}, l_{j}\right) =\sum _{k=1}^{M} \sum _{q=1}^{M} P\left( l_{i k}, l_{j q}\right) \log \frac{P\left( l_{i k} \mid l_{j q}\right) }{P\left( l_{j q}\right) } \end{aligned}$$
(12)

A labeled undirected graph \((\textrm{WUG})=(\textrm{V}, \textrm{E}, \textrm{W})\) can be constructed by applying Eq. 12. \(V=L=\left\{ l_{1}, l_{2}, \ldots , l_{m}\right\}\) represents the set of nodes of the undirected graph, \(E=\left\{ \left( l_{i}, l_{j}\right) \mid l_{i}, l_{j} \in L\right\}\) represents its set of edges, and \(w\left( l_{i}, l_{j}\right) =M I\left( l_{i}, l_{j}\right)\) represents the weight of each edge [55]. The importance of each node in this undirected graph is defined as follows:

$$\begin{aligned} L W\left( l_{i}\right) =(1-d)+d \sum _{l_{j} \in S N\left( l_{i}\right) } \frac{L W\left( l_{j}\right) w\left( l_{i}, l_{j}\right) }{S W\left( l_{j}\right) } \end{aligned}$$
(13)
$$\begin{aligned} S W\left( l_{j}\right) =\sum _{l_{j}} w\left( l_{i}, l_{j}\right) \end{aligned}$$
(14)

\(L W\left( l_{i}\right)\) and \(L W\left( l_{j}\right)\) represent the weight divisions of nodes \(l_{i}\) and \(l_{j}\), respectively. \(S N\left( l_{i}\right)\) is the set of nodes with edges to label \(l_{i}\), and \(w\left( l_{i}, l_{j}\right) =M I\left( l_{i}, l_{j}\right)\) represents the correlation between nodes. Equation 10 is used to calculate \(S N\left( l_{i}\right)\), which denotes the sum for the correlation for all edges starting from \(l_{j} \cdot d\) is the damping coefficient, for which it is recommended to use \(d=0.85\) citerk58. For ease of calculation, an initial weight value can be set for all nodes; this is usually 1/L, where L is the total number of nodes, that is, the total number of labels [56]. Using this algorithm, we can calculate the correlation between node \(l_{i}\) (i.e., label \(l_{i}\) ) and other nodes \(l_{j}\) related to it, as well as the structure of the graph (WUG). Through label correlation, we obtain the weight of each label in the label space and we complete the exploration of label correlation.

3.3 Feature selection based on neighborhood rough sets

The multi-label domain decision system \(N D S=\langle U, C \cup D, \delta \rangle\) is handled in a similar manner to the single-label decision system. By extending the rough set importance theory for multi-label data (Eq. 6) and combining the multi-label neighborhood particles (Eq. 10) and label correlation (Eq. 13), we obtain the importance of the feature subset \(B({ B \subseteq C})\) for the decision attribute set \(D=L=\left\{ l_{1}, l_{2}, \ldots , l_{m}\right\}\):

$$\begin{aligned} \gamma _{B}(D)^{\prime }=\sum _{l_{j} \in L}^{m} \frac{{\text {Card}}\left( {\text {POS}}\left( D^{j}\right) ^{\prime }\right) L W\left( l_{j}\right) }{{\text {Card}}(U)} \end{aligned}$$
(15)

The above equation reflects the importance of the decision-positive region and the corresponding decision attributes of the feature subset B. It solves the problems of granularity selection and feature association for multi-label NRS.

According to Eq. 15, in the neighborhood decision system \(N D S=\langle U, C \cup D, \delta \rangle , B \subseteq\) C is a feature subset, \(a \in C-B\), and the degree of importance of a to B is defined as follows:

$$\begin{aligned} {\text {SIG}}(a, B, D)^{\prime }=\gamma _{B \cup a}(D)^{\prime }-\gamma _{B}(D)^{\prime } \end{aligned}$$
(16)

In the new importance model, we have added label importance and label relevance to the NRS model. The new NRS model reflects the fusion of feature information and label correlation.

For the above NRS model, we construct a greedy forward search-based multi-label feature selection algorithm. To illustrate the proposed algorithm more clearly, the framework of the algorithm is presented in Fig. 1.

Fig. 1
figure 1

Framework of static multi-label feature selection

According to the framework shown in Fig. 1, our proposed forward greedy [57] multi-label feature selection algorithm behaves as follows. The final reduced reduct is the best subset after feature selection on the feature space.

Algorithm 1
figure a

Static Multilabel Feature Selection Algorithm based on Label Importance and Label Correlation (SMFS-LILC).

In Algorithm 1, steps 1–5 perform the preparation work when multiple items of labeled data arrive. Our reduct set reduct starts from the empty set and calculates the label weights LW(L) of the entire label space. This step requires the traversal of the entire label space and the construction of an undirected graph. Assuming that the number of labels in the label space is L, the time complexity of the calculation of the correlation between each pair of labels is \(\textrm{O}\left( \mid L\mid ^{2}\right)\), and that of the calculation of each label weight is O(1). Therefore, the time complexity of steps 1–5 is \(O\left( \mid L\mid ^{2}+1\right) =O\left( \mid L\mid ^{2}\right)\). Steps 6–21 are divided into two parts: calculating the neighborhood of the instance and analyzing whether the instance and the neighborhood are important. First, by selecting the average approximation neighborhood as the domain granularity standard (step 6–12), this step requires searching for the nearest hit or miss for each instance; assuming that the instance space is U, the time complexity of this step is \(\textrm{O}\left( \mid U\mid ^{2}\right)\). Next, the neighborhood corresponding to the instance is determined and the decision-positive region and attribute importance are calculated (step 13–21). The time complexity for determining the neighborhood of each instance is \(\textrm{O}(n \log n)\), and the time complexities of the calculations of the decision-positive region and importance are both O(1), so the overall time complexity of the calculation of the instance domain is \(\textrm{O}\left( \mid U\mid ^{2}+\mid U\mid \log \mid U\mid +1+1\right) =\textrm{O}\left( \mid U\mid ^{2}\right)\). The time complexity for determining whether the samples in the instance neighborhood are consistent is O(n) and, if the number of features in the feature space is C, the time complexity of steps 6–21 is \(\textrm{O}\left( \mid C\mid \mid U\mid ^{2}\right)\). Therefore, the time complexity of Algorithm 1 is \(\textrm{O}\left( \mid L\mid ^{2}+\mid C\mid \mid U\mid ^{2}\right)\).

3.4 Dynamic multi-label feature selection algorithm based on label importance and label correlation

Algorithm 1, similarly to most feature selection algorithms, assumes that all candidate features are available to the algorithm before feature selection. In contrast, with flow features, all features cannot be collected before learning starts because they arrive, dynamically and incrementally, over time. Therefore, we propose an online multi-label flow feature selection algorithm based on Algorithm 1 combined with the online flow feature selection framework [51], to solve the multi-label flow feature selection problem.

In the multi-label flow feature decision system \(N F D S=\langle U, C \cup L, t\rangle\), \(U=\left\{ x_{1}, x_{2}, \ldots , x_{n}\right\}\) represents a series of non-empty sample sets, C represents the feature set corresponding to the sample, L represents the label set, and t represents the arrival time of the flow feature. \(F_{t}\) denotes the newly arrived feature at time t, and \(S_{t-1}\) denotes the reduced feature subset reduct at time t.

3.4.1 Importance analysis

For a newly arrived feature \(F_{t}\), the first step is to perform importance analysis on \(F_{t}\). The purpose of importance analysis is to evaluate whether \(F_{t}\) is beneficial to the label set L, that is, to evaluate the importance of \(F_{t}\) to the whole label set L. We define a parameter \(\delta\) to assess the importance of \(F_{t}\) and use Eq. 15 to calculate the importance \(\gamma _{F_{t}}(D)^{\prime }\) of \(F_{t}\) to the entire label set. If \(\gamma _{F_{t}}(D)^{\prime }<\delta\), we consider \(F_{t}\) to be unimportant to the label set L, so \(F_{t}\) is discarded.

3.4.2 Significance analysis

After the above importance analysis, we believe that \(F_{t}\) is important to the label set L. However, we also need to consider the relationship between \(F_{t}\) and the current reduced feature set \(S_{t-1}\). The purpose of significance analysis is to evaluate the relevance of the newly arrived feature \(F_{t}\) to the subset of features at time t, that is, to check whether \(F_{t}\) is significant in the current feature subset. \(F_{t}\) is compared with the average value \(\textrm{Avg}_{\gamma }\) of the importance of each feature in the current feature subset \(S_{t-1}\)

Here we use the iterative method to calculate the average \(\textrm{Avg}_{\gamma }\):

$$\begin{aligned} \textrm{Avg}_{\gamma }=\textrm{Avg}_{\gamma -1}+\frac{\gamma _{F_{i}}(D)^{\prime }-\textrm{Avg}_{\gamma -1}}{\mid F_{i}\mid } \end{aligned}$$
(17)

where \(\textrm{Avg}_{1}=\gamma _{F_{1}}(D)^{\prime }\).

If \(\gamma _{F_{t}}(D)^{\prime } \ge \textrm{Avg}_{\gamma }\), the importance of the new feature \(F_{t}\) to the label set L is greater than or equal to the average importance of the already achieved features in \(S_{t-1}\). Therefore, we consider \(F_{t}\) to be a significant feature, which should be preserved.

3.4.3 Redundancy analysis

After the above significance analysis, we already know that \(F_{t}\) is beneficial to the current \(S_{t-1}\). However, we also need to analyze the relationship between \(F_{t}\) and the features in \(S_{t-1}\). The purpose of redundancy analysis is to compare the contributions of features \(F_{k}\) and \(F_{t}\) to \(S_{t-1}\) in the current reduction set \(S_{t-1}\). When the contributions of two features are the same, they are repeated, and one of them must be discarded.

For two features \(F_{t}\) and \(F_{k}\), if \({\text {SIG}}\left( F_{t}, S_{t-1}, D\right) ^{\prime }={\text {SIG}}\left( F_{k}, S_{t-1}, D\right) ^{\prime }, F_{t}\) and \(F_{k}\) have the same degree of contribution to \(S_{t-1}\). Therefore, we compare \(\gamma _{F_{k}}(D)^{\prime }\) and \(\gamma _{F_{t}}(D)\). If \(\gamma _{F_{k}}(D)^{\prime } \ge \gamma _{F_{t}}(D)^{\prime }\), we preserve \(F_{k}\) and discard \(F_{t}\); if \(\gamma _{F_{k}}(D)^{\prime }<\gamma _{F_{t}}(D)^{\prime }\), we preserve \(F_{t}\) and discard \(F_{k}\).

Fig. 2
figure 2

Flow feature selection framework

The flow feature selection framework, illustrated in Fig. 2, is based on online importance analysis, significance analysis, and redundancy analysis. In this framework, a training set with known feature sizes is used to simulate flow features, and each flow feature is generated from the candidate feature set. In the framework shown in Fig. 2, we propose a dynamic multi-label feature selection algorithm that considers label importance and label correlation (Algorithm 2), which incorporates the above three types of analysis.

Algorithm 2
figure b

Dynamic Multi-label Feature Selection Algorithm Based on Label Importance and Label Correlation (DMFS-LILC).

The main computation performed by Algorithm 2 is the computation of dependencies between features. At time \(t, S_{t-1}\) is the number of features in the currently selected feature set. Algorithm 2 assesses whether the new feature \(F_{t}\), arriving at time t, needs to be retained and decides how to retain it. The entire process is an online selection problem that comprises three main parts: importance analysis, significance analysis, and redundancy analysis, which are marked in Algorithm 2. The feature calculation performed by the algorithm is taken from Algorithm 1, and the time complexity of the selection of a single feature is \(\textrm{O}(\mid U\mid \log \mid U\mid )\). In the best case, online selection can obtain the best subset immediately, so the time complexity is \(\textrm{O}\left( \mid L\mid ^{2}+\right.\) \(\mid L\mid \mid U\mid \log \mid U\mid )\). However, in most cases, Algorithm 2 is neither simple nor optimistic, and it needs to be updated online for \(S_{t}\) Because the time complexity of the \(S_{t}\) update depends on the calculation of feature dependencies, in the worst case it is necessary to go through all selected features to process \(F_{t}\), and therefore the worst-case time complexity is \(\textrm{O}\left( \mid L\mid ^{2}+\mid S_{t-1}\mid \mid L\mid \mid U\mid \log \mid U\mid \right)\).

4 Experiment

4.1 Datasets and experimental design

To validate the performance of our proposed algorithms, we used nine benchmark datasets from various application domains as our experimental data [27, 58]. The Arts, Business, Computer, Health, and Scene datasets are all from Yahoo and are widely used for web text classification. The Birds dataset identifies classes of birds by recordings of their calls. It contains 645 sound samples, 260 features extracted from the sound recordings, and 20 labels. (One of the samples, with nonexistent labels, represents background noise.) The Cal500 dataset is a dataset of 500 English songs. The Emotions dataset is also a music dataset, which consists of 593 music samples, each belonging to one of six classes. The Yeast dataset is used to predict functional classes of yeast genes and consists of 2417 samples, each representing a gene and 14 actionable tags. Table 1 shows standard statistics for the nine multi-label datasets: the number of samples, number of features, number of labels, number of samples in the training set, number of samples in the test set, label cardinality, and label density.

Table 1 Dataset introduction

In our experiments, we compared our proposed algorithms with several multi-label feature selection algorithms, including MDDM, PMU, RF-ML, NRPS [59], and MFSF [60], all of which reflect the effectiveness of feature selection from different perspectives.

The experiments used five evaluation criteria, namely average precision (AP), ranking loss (RL), coverage (CV), one-error (OE), and Hamming loss (HL), to evaluate the performance of all multi-label feature selection algorithms [61]. These five criteria were designed to evaluate performance from different perspectives, and there are usually several algorithms that achieve the best performance with respect to all these criteria at the same time. Finally, the performance of all algorithms was evaluated using the MLKNN (\(K = 10\)) classifier [62].

Because each sample of the multi-label data corresponds to a set of labels, the evaluation method for multi-label data is more complicated than that for traditional single-label data. The set \(T=\{(x_i,y_i)\mid 1\le i\le N\}\) represents a given test set, where \(y_i\subseteq L\) is the correct label subset and \(Y_i^{\prime }\subseteq L\) represents the binary label vector predicted by the multi-label classification algorithm.

Average precision (AP): AP is the average fraction of labels ranked higher than a specific label \(\gamma \in y_i\). A larger value of AP corresponds to a better prediction performance of the whole classifier.

$$\begin{aligned} A P=\frac{1}{N} \sum _{i=1}^N \frac{1}{\mid y_i\mid } \sum _{\gamma \in y_i} \frac{\mid \left\{ \gamma ^{\prime } \in y_i{:}\, r_i\left( \gamma ^{\prime }\right) \le r_i(\gamma )\right\} \mid }{r_i(\gamma )} \end{aligned}$$
(18)

where \(r_i(\gamma )\) denotes the rank of the corresponding label \(l \in L\) after the given sample \(x_i\) is predicted by the learning algorithm.

Hamming loss (HL): HL indicates the number of times a sample-label instance is misclassified.

$$\begin{aligned} H L=\frac{1}{N} \sum _{i=1}^N \frac{\mid Y_i^{\prime } \oplus y_i\mid }{M} \end{aligned}$$
(19)

where \(\oplus\) denotes the XOR operation; a smaller value of HL corresponds to a better result.

Ranking loss (RL): RL indicates how many irrelevant tags are ranked higher than relevant tags. RL is the average probability of an item that is not in the set of relevant labels being ranked (in the resultant ranking) among items that are in the set of relevant labels.

$$\begin{aligned} R L=\frac{1}{N} \sum _{i=1}^N \frac{1}{\mid y_i\mid \mid {\bar{y}}_l\mid } \mid \left\{ \left( \lambda _1, \lambda _2\right) \mid \lambda _1 \le \lambda _2,\left( \lambda _1, \lambda _2\right) \in y_i \times \overline{y_l}\right\} \end{aligned}$$
(20)

where \(\lambda _i\) denotes the real-valued likelihood between the label value of \(x_i\) and each \(l_i \in\) L after classification by the multi-label classifier, and \(\overline{y_l}\) denotes the complementary set of \(y_i\). A smaller value of RL corresponds to a better result.

Coverage (CV): \(\textrm{CV}\) evaluates how many steps are needed, on average, to traverse the list of labels in such a manner that all the ground-truth labels of the instance are covered.

$$\begin{aligned} C V=\frac{1}{N} \sum _{i=1}^N \max _{\lambda \in y_i} {\text {rank}}(\lambda )-1 \end{aligned}$$
(21)

where \({\text {rank}}(\lambda )\) denotes the rank of \(\lambda\). If \(\lambda _1>\lambda _2\), then \({\text {rank}}\left( \lambda _1\right) <{\text {rank}}\left( \lambda _2\right)\). A smaller value of \(\textrm{CV}\) corresponds to a better result.

One-error \((\textrm{OE})\): \(\textrm{OE}\) is the probability that the label ranked first in the output result does not belong to the actual label set.

$$\begin{aligned} Q E=\frac{1}{N} \sum _{i=1}^N\left[ \left[ {\text {argmax}}_{y_i \subseteq L} f\left( x_i, y_i\right) \right] \notin Y_i^{\prime }\right] \end{aligned}$$
(22)

where \([\mid \pi \mid ]=\left\{ \begin{array}{l}1, \pi \text{ is } \text{ true } \\ 0, \pi \text{ is } \text{ false } \end{array}\right.\). A smaller value of \(\textrm{OE}\) corresponds to a better result.

Of these evaluation criteria, \(\textrm{AP}, \textrm{CV}, \textrm{OE}\), and \(\textrm{RL}\) focus on the label ranking performance of each instance, whereas \(\textrm{HL}\) focuses on the label set prediction performance of each instance.

4.2 Experimental results

4.2.1 Evaluation of predictive performance of algorithms

We compared the two proposed algorithms-the static multi-label feature selection algorithm (SMFS-LILC) and dynamic multi-label feature selection algorithm (DMFS-LILC)-with MDDMproj, MDDMspc, PMU, RF-ML, NRPS, and MFSF with respect to predictive classification performance. The first four of these are widely used multi-label classification algorithms, and the last two are multi-label feature selection algorithms proposed in the past two years that combine NRS with flow features. To ensure comparable results, the features obtained by all algorithms were ranked, and the final feature subset of all algorithms contained the same number of features as the final feature subset of DMFS-LILC. Because all algorithms in the comparison use the results of feature selection as the result of feature ranking, we present in Tables 2, 3, 4, 5 and 6 the detailed experimental results for all algorithms on each classification dataset. Each evaluation criterion is labeled by “\(\downarrow\)” to mean “smaller is better” or “\(\uparrow\)” to mean “larger is better”. In addition, the best predictive classification performance, with respect to each evaluation criterion, is shown in bold, the second-best performance is \({\underline{\hbox {underlined}}}\), and the average performance of each algorithm is shown in italics.

Table 2 Comparison of average precision (\(\uparrow\)) of eight feature selection algorithms
Table 3 Comparison of ranking loss (\(\downarrow\)) of eight feature selection algorithms
Table 4 Comparison of coverage (\(\downarrow\)) of eight feature selection algorithms
Table 5 Comparison of one-error (\(\downarrow\)) of eight feature selection algorithms
Table 6 Comparison of Hamming loss (\(\downarrow\)) of eight feature selection algorithms

The experimental results, shown in Tables 2, 3, 4, 5 and 6, are as follows:

  1. (1)

    With respect to AP, DMFS-LILC outperformed the existing algorithms on all seven datasets, whereas SMFS-LILC achieved suboptimal performance on five datasets. The two proposed algorithms achieved good performance on all multi-label datasets in the experiment.

  2. (2)

    With respect to RL, OE, and HL, DMFS-LILC achieved the best performance on six multi-label datasets. In addition, it achieved second-best or close to second-best performance on the remaining datasets. With respect to RL and HL, the predictive classification performance of DMFS-LILC was also very close to the optimal performance of another existing algorithm on the multi-label datasets. The performance achieved by DMFS-LILC was close to the optimal performance. In contrast, SMFS-LILC achieved suboptimal performance on five datasets and optimal performance on two datasets, with respect to RL. In particular, with respect to OE, SMFS-LILC achieved suboptimal performance on all datasets.

  3. (3)

    With respect to CV, DMFS-LILC significantly outperformed all existing algorithms on at least five multi-label datasets. Although SMFS-LILC performed worse than MFSF and (on some datasets) DMFS-LILC performed worse than the existing algorithms, the CV achieved by DMFS-LILC and SMFS-LILC was not very different from that of the two existing algorithms that performed better. In addition, on the datasets on which performance was less good, the results of DMFS-LILC were still good. In addition, SMFS-LILC achieved suboptimal performance on five datasets. In summary, DMFS-LILC and SMFS-LILC did not perform significantly better than existing algorithms with respect to the CV evaluation criterion.

  4. (4)

    In general, with respect to all the criteria, the average classification performance of DMFS-LILC was significantly better than that of all existing algorithms, and SMFS-LILC was the second best with respect to average performance. These experimental results show that DMFS-LILC and SMFS-LILC achieved better performance than the existing algorithms.

Because of differences in the data types and other aspects of the evaluation criteria, prediction performance is expected to vary. To clearly assess the differences between the algorithms, the prediction performance was normalized to [0.1, 0.5], following [63]. Figure 3 shows the stability indicators of the normalized AP, HL, RL, CV, and OE. Each corner of the spider graph in Fig. 3 represents a different dataset and each colored line represents a different algorithm.

Fig. 3
figure 3

Spider web diagrams for stability analysis

If the area of the graph composed of lines of a specific color is large and its shape is similar to a regular nonagon, the performance and stability of the corresponding algorithm are good. A stability value of approximately 0.5 is considered to be a good value. From Fig. 3, the following observations can be made:

  1. (1)

    With respect to AP, DMFS-LILC achieved the best stability because its shape closely approximates a regular nonagon and has the largest enclosed area.

  2. (2)

    With respect to RL, OE, and HL, DMFS-LILC maintained stability on at least six datasets.

  3. (3)

    With respect to CV, the nonagons of DMFS-LILC and SMFS-LILC are similar to those of NRPS and MFSF. Therefore, their performance advantages over the existing algorithms are not as obvious as for other evaluation criteria.

  4. (4)

    For all the evaluation criteria, the shapes of DMFS-LILC and SMFS-LILC have areas that are larger than, or similar to, those of the existing algorithms, and they are closer to regular nonagons. In fact, a comprehensive analysis of the results indicates that the performance and stability of the SMFS-LILC algorithm are second best, whereas the stability of DMFS-LILC is optimal.

4.2.2 Statistical test

Because some experimental results are quite similar, statistical tests can be used to verify whether these results differ significantly. We used the Friedman test to systematically analyze the differences between the results of the algorithms in the comparison. This is a widely accepted method of statistically comparing the results of multiple algorithms for significant differences across many datasets [64]. The method is as follows. Given k algorithms and N multi-label datasets, \(R_{j}=\frac{1}{N} \sum _{i=1}^{N} r_{i}^{j} v\) represents the average rank of the jth algorithm on all datasets, where \(r_{i}^{j}\) is the rank of algorithm j on the ith dataset. Under the null hypothesis (where it is assumed that the classification performance of all algorithms under each evaluation criterion are equal, that is, the ranks of all algorithms are equal), the Friedman test is defined as

$$\begin{aligned} F_{F}=\frac{(N-1) \chi _{F}^{2}}{N(k-1)-\chi _{F}^{2}} \text{, } \text{ where } \chi _{F}^{2}=\frac{12 N}{k(k+1)}\left( \sum _{i=1}^{k} R_{i}^{2}-\frac{k(k+1)^{2}}{4}\right) \end{aligned}$$
(23)

where \(F_{F}\) follows an F-distribution with \((k-1)\) and \((k-1)(N-1)\) degrees of freedom. Table 7 summarizes the \(F_{F}\) values and the corresponding critical values of each evaluation criterion after the Friedman test statistics [65].

Table 7 Friedman test (\(k=8, N=9)\) summary of \(F_F\) value and critical value of each evaluation criterion on \(\alpha = 0.10\)

As shown in Table 7, the null hypothesis is clearly rejected for all evaluation criteria with a significance level of \(\alpha = 0.10\). Next, we used a post-hoc test to further determine the differences in the statistical performance of the various algorithms. Because our purpose was to compare the performance of the two proposed methods with that of the other algorithms, the Bonferroni–Dunn test was used [66]. The performance of two compared algorithms is considered to be significantly different if the distance between the average ranks of the two algorithms exceeds the following critical difference (CD).

$$\begin{aligned} C D_{\alpha }=q_{\alpha } \sqrt{\frac{k(k+1)}{6 N}}. \end{aligned}$$
(24)

For the Bonferroni–Dunn test, at a significance level of \(\alpha =0.01\), we have \(q_{\alpha }=\) 2.450, so we obtain \(C D_{\alpha }=2.8290\).

To visualize the relative performance of SMFS-LILC and DMFS-LILC compared with that of the other six algorithms, we plotted the CD for each evaluation criterion, with the average ranking of each compared algorithm on the axis. We consider the rightmost algorithm to be the best, so the lowest ranking on the axis is on the right. The CD plots for all evaluation criteria are shown in Fig. 4.

Fig. 4
figure 4

Bonferroni-Dunn test of SMFS-LILC and DMFS-LILC in comparison with existing algorithms

From Fig. 4 we can observe the following:

  1. (1)

    SMFS-LILC and DMFS-LILC are significantly better than MDDMspc, MDDMproj, RF-ML, and MFSF with respect to all evaluation criteria. In particular, DMFS-LILC has obvious advantages compared with them.

  2. (2)

    SMFS-LILC is statistically superior to, or at least comparable to, MFSF and NRPS with respect to all evaluation criteria, and DMFS-LILC also shows significant advantages over those algorithms, with respect to some criteria.

  3. (3)

    Although the classification performance of SMFS-LILC and MFSF is comparable, the average classification performance of DMFS-LILC in Tables 2, 3, 4, 5 and 6 is significantly better than that of the other algorithms in the comparison. In summary, DMFS-LILC has significantly stronger performance than the other algorithms.

5 Conclusion

In this paper, we propose an NRS model based on label importance and label correlation. We first define a new neighborhood particle by the mean nearest neighborhood method, to better correlate the information between features of multiply labeled data, and solve the problem of neighborhood granularity caused by such data. The feature correlation weights are then obtained by calculating the mutual information between features, and the new neighborhood lower bound approximation is combined with the feature weights to obtain a new feature subset importance model. On the basis of this model, we propose a new static forward greedy algorithm (SMFS-LILC) for multi-label feature selection. In addition, we propose a dynamic feature selection algorithm (DMFS-LILC), based on SMFS-LILC, to evaluate features that arrive incrementally over time by importance analysis, significance analysis, and redundancy analysis to solve the multi-label stream feature problem. Experimental results showed that our algorithms are competitive with existing commonly used algorithms. However, the time complexity of the proposed algorithms is relatively high, compared with that of state-of-the-art multi-label feature selection methods. Therefore, in future work, we hope to reduce the computation time of the algorithm. Furthermore, by solving multi-label problems using label importance and label correlation, or by handling features and labels by mutual information methods, these methods can also be extended to the feature selection problem of label distribution.