Introduction

Supervised learning involves multi-label [18], multi-criteria [14, 32], and multi-class components. The goal of multi-label classification is to predict a set of labels based on features. Taking news classification in natural language processing as an example, a message may contain a set of labels such as politics, economics, and taxation. The number of labels assigned to a sample is typically much smaller than the total number of possible labels, and the number of labels often varies. In some applications of multi-label learning, such as compound fault detection [19, 57], the accurate detection of samples belonging to minority classes is crucial. Detecting samples corresponding to different labels, especially minority classes, has become a major challenge in imbalanced multi-label learning [45].

Two main approaches have been proposed to address the issue of class imbalance and effectively identify the different classes: algorithm-level [17] and data-level [28, 41]. By refining elements such as the loss function, the algorithm-level approach helps classifiers prioritize minority classes. This method is more challenging to assess across a variety of classifiers because it focuses on only some specific classifiers. The data-level approach uses methods such as downsampling and oversampling to balance the label distribution and works well for various classifiers. Creating synthetic data for multi-label datasets presents several challenges.

We expand upon the above issues.

Generation ratio. The effectiveness of adjusting the sample generation fraction to strengthen small disjunct sections was highlighted in earlier studies on oversampling [16]. The cluster-based method, kmeans-SMOTE [10], has been successfully tested in imbalanced binary problems. It determines the generation ratio of samples based on clustering information and performs interpolation operations within each cluster. When directly applied to multi-label learning, this method often results in an uneven distribution of minority-label clusters, which makes it more challenging to define the intra-cluster generation ratio.

Relative samples. The position of the synthetic data in SMOTE is determined by the selection of relative samples. The classifier exhibits sensitivity to generated noise samples [20], which raises the question of whether the synthetic data are connected to the spatial distribution of minorities.

Label-set-specific. While MLSOL [26] and MLSMOTE [6] use local data for label assignment, they rarely take into account the complexities of the label set. To expand the distribution of the created minority labels, the label set should be applied based on the minority-relevant samples. Label revisions for synthetic examples should increase the proportion of borderline samples.

To address the above challenges, we designed a synthetic data method, \(\textbf{M}\)ulti-\(\textbf{L}\)abel \(\textbf{C}\)luster \(\textbf{SMOTE}\)-\(\textbf{F}\)uzzy \(\textbf{R}\)ough \(\textbf{S}\)et \(\textbf{T}\)heory (MLCSMOTE-FRST). At the attribute level, we apply the principle of multi-label learning with label-specific features (LIFT) [54] to ensemble the distance and distribution of clusters in imbalanced multi-label oversampling. The generation ratio is determined by the average distance between the clusters that are specific to each label and the imbalance ratio. This prioritizes the positioning of samples within clusters with a higher imbalance ratio for generation. At the label level, we utilize seed samples and intra-cluster samples to develop a label revision method based on fuzzy rough set theory. The positive samples within the cluster are used to assign labels to the synthetic data to strengthen the minority labels. Empirical evaluations across 12 multi-label datasets employing four classifiers underscore the effectiveness of our approach.

This paper makes the following contributions:

  • We propose a new approach to optimizing attribute-level and label-level synthetic data generation;

  • At the attribute level, we initially use MLCSMOTE to determine the proportion and position of generation utilizing label-specific clusters;

  • At the label level, we propose using fuzzy rough set theory and intra-cluster samples to determine the label set to generate simulated data, short for FRST. MLCSMOTE-FRST shows an improvement over other oversampling methods on 12 datasets.

The remainder of this paper is organized as follows. Section “Related work” discusses related work, including imbalanced multi-label learning, cluster SMOTE methods, and fuzzy rough set theory. Section “Proposed method” introduces our proposed method. Section “Experiments” discusses our experiments and analyzes the results. Section “Conclusion” relates our conclusions.

Related work

Imbalanced multi-label learning

An imbalanced class distribution is often encountered in multi-label datasets. There are both algorithm-level and data-level methods to effectively extract information from minority classes.

Algorithmic-level methods focus on optimizing the weights of minority samples by improving multi-label classifiers. DB-Focal [49] optimizes long-tail image recognition by adjusting loss weights and reducing the suppression of negative-class labels during model training. It has been validated on publicly available image datasets such as COCO-MLT and VOC-MLT. Chen et al. [8] introduced the AdaBoost architecture to the application of street scene recognition, with the assumption that minority samples will progress to the subsequent stages of AdaBoost and have class weight loss added in those stages. Duan et al. [11] exploited the mining ability of an ensemble of classifier chains [38] in high-order label correlations and combined it with binary-class classifiers in a general method for dealing with imbalanced data. MLHC [12] used hierarchical clustering to transform the multi-label into multi-class to optimize imbalance, improving upon previous algorithms that only considered random correlations. Data-level interventions foster class balance through either downsampling or oversampling methods and are valued for their flexibility in improving the performance of most classifiers.

MLeNN [4] and MLTL [35] use the Tomek link method to remove the majority samples that overlap with others. Roseberry et al. [39] incorporated downsampling into the identification of data streams, enabling quick adaptation to their changes by removing data. The oversampling algorithm extends minority samples by generating synthetic samples, through either repetitive or SMOTE-based approaches. LP-ROS and ML-ROS [5] perform repetitive oversampling by redefining the minority labels, while REMEDIAL [7] clones samples and decouples the labels. MLSMOTE [6] and MLSOL [26] differ in that MLSOL utilizes local label inconsistencies to design the data-generation weights and reduce noise interference with SMOTE. EMLSOL [27] combines MLSOL with downsampling. LCOS [53] considers label correlations and assigns weights to synthetic instances based on their distance information. The introduction of label correlation in multi-label imbalance problems is a current research trend, while the rational use of fuzzy rough sets has not yet been proven.

Cluster-based SMOTE methods

SMOTE can effectively balance the number of classes by interpolating between samples. It has been successfully applied such as in binary classification [52], multi-classification [33], and regression [2].

The basic steps of SMOTE include finding the relative instances around the seed instance using kNN and generating synthetic samples between them. After obtaining a predetermined number of synthetic instances, the classifier is trained using the newly balanced dataset. The SMOTE operation interpolates \(x_{related}\) randomly selected from the neighborhood samples of \(x_{seed}\),

$$\begin{aligned} x_{sys}{} & {} = x_{seed} + (1 - \lambda )\nonumber \\{} & {} \quad (x_{relative} - x_{seed}) \text {,} x_{relative} \in kNN(x_{seed})), \end{aligned}$$
(1)

where \(\lambda \) is a random value ranging from 0 to 1. Effective acquisition or generation of features is crucial for pattern recognition or deep learning tasks, such as camera position estimation [30], genomic data analysis [21], and satellite reflectance image processing [50]. Some studies have investigated factors like the position of SMOTE-generated features. Fernández et al. [16] divided the traditional SMOTE optimization direction into seven objectives, where the focus optimization directions include assigning sample generation weights and selecting relative instances. These directions are directly related to the generation results, as they reduce the impact of noisy samples and strengthen the weights of boundary samples.

k-meansSMOTE [10] specifies the proportion of different clusters generated in a binary classification problem using the results of k-means clustering. It restricts relative instances to samples that are within a cluster. This assumption is based on the idea that the minority may exhibit a large proportion of similar features across different samples. Dense clusters can be obtained by k-means, assigning a high generation proportion and using within-cluster relative instances for improvement.

Zhang et al. [59] applied cluster-based SMOTE for data augmentation in a fault diagnosis dataset. We note that k-means clustering can be significantly impacted in fault diagnosis applications due to the existence of outliers, which are replaced with k-medoids. OEGFCM-SMOTE [15] recognizes that k-means is sensitive to outlier and boundary instances and that manually selecting hyperparameters has a greater impact on the results. Applying fuzzy c-means discovers non-convex shapes or clusters of different sizes and utilizes entropy to evaluate the quality of synthetic samples and determine optimal parameters via a nonlinear mixed optimization problem.

LR-SMOTE [24] combines data cleaning with cluster-SMOTE. It removes misclassified minority samples from SVM if there are no other minority samples around them to build k-meansSMOTE. Besides using a classifier to filter samples, SMOTE-COF [31] uses the local information center offset factor as a method to remove noisy instances. For the processed dataset, the samples are generated using the intra-cluster instances as the reference instances for SMOTE. This method has the advantage of generating samples that are free from noise.

Fuzzy rough set application

In a fuzzy set [51], an element can belong to a set to a degree; in a rough set [34], imprecision is expressed by a boundary region. Algorithms related to fuzzy sets are widely used in practical applications including dynamic event-triggered mechanisms [42], dual cyber attacks [58], and composite adaptive control [43]. Fuzzy rough set theory [13] deals with indiscernibility in data. The measure calculates similarity between instances based on a fuzzy relation. A positive region (POS) describes the degree of membership between samples \(x_{1}\) and \(x_{2}\),

$$\begin{aligned} POS(x_{1})=max_{x_{2} \in X}[q_{x_{1}}(x_{2})], \end{aligned}$$
(2)

where \(q_{x_{1}}(x_{2})\) measures the quality of \(x_{1}\) to \(x_{2}\). In SMOTE application, [47] proposed FRIPS-SMOTE-FRBPS, which is based on SMOTE and fuzzy rough theory. To minimize the impact of noise samples on the classifier, majority and synthetic instances that are identified as noise by fuzzy rough set theory are systematically removed. SMOTE-FRST-2T [36] assumes that some synthetic data may not be similar to the minority class. POS is defined as the similarity between the majority class and synthetic samples. SMOTE-based synthetic samples get the POS with the center of the class and filter simulated data accordingly. In previous studies, the application of fuzzy rough set theory in SMOTE has mainly been used in binary problems to remove synthetic samples. In a multi-label problem, fuzzy rough set theory determines a membership degree C based on the similarity between attributes and labels. The upper and lower approximated membership degrees of an element \(x_{1}\) are, respectively, computed as

$$\begin{aligned} \overline{C}(x_{1})&= max_{x \in X}\mathcal {T}(R(x_{1},x),C(x))\end{aligned}$$
(3)
$$\begin{aligned} \underline{C}(x_{1})&= min_{x \in X}\mathcal {I}(R(x_{1},x),C(x)), \end{aligned}$$
(4)

where \(\mathcal {I}\) is a fuzzy implicator mapping \([0,1]^2 \rightarrow [0,1]\) to decrease in the first and increase in the second argument. \(\mathcal {T}\) is a fuzzy t-norm, which is a commutative and associative mapping \([0,1]^2 \rightarrow [0,1]\). To examine the relationship between samples with labels in local information, FRONEC [48] predicts labels by identifying the training instance with the highest quality in relation,

$$\begin{aligned} POS(x_{1}){} & {} =max_{x_{2} \in X}[min_{x_{3} \in X}\nonumber \\{} & {} \quad [\mathcal {I}(R(x_{3}, x_{1}), R_{d}(x_{2}, x_{3})]) ]. \end{aligned}$$
(5)

To establish the association between \(x_{1}\) and the label set, Eq. (5) establishes \(R(x_{1}, x_{2})\) based on the association of the attributes. It determines the potential label set using \(x_{2}\) and \(R_{d}(x_{2}, x_{3})\).

In the application research of fuzzy set theory and classifiers, [22] applied fuzzy membership to calculate the output weights of the extreme learning machine. This allows for the weighting of corresponding classes of training samples. Enlve et al. [29] extended fuzzy set theory to antecedent learning and online active learning based on the classical Takagi Sugeno model architecture. Further research is needed on the application of fuzzy rough set theory to synthetic samples in multi-label datasets.

Proposed method

MLCSMOTE-FRST has three steps: constructing the generation ratio related to label-specific clusters; determining the attributes of the synthetic samples using intra-cluster samples and SMOTE; and revising the label set using intra-cluster samples and fuzzy rough set theory.

We use the following notation. \(D=\{ \{x_{i},y_{i} \},1<i<n\}\) is a multi-label dataset. The labels L have dimension q, i.e., \(L = \{ l_{1},\ldots ,l_{j},\ldots ,l_{q} \}\). \(Y_{j}\) is the j-th label dimension, and \(y_{ij}\) is the value of the i-th sample in the j-th label dimension.

Generation ratio via label-specific cluster

In contrast to clustering for binary classes, the long-tail label set is sparser in the global distribution. There are two problems with using k-means clustering. Computing the imbalance ratio in the cluster using possible label sets requires the enumeration of all label sets, and to achieve the condition of \((N_{min} > N_{max})\) in k-means SMOTE is challenging due to the small size of the minority-label set within a cluster.

Inspired by LIFT, MLCSMOTE generates label-specific clusters, to create distinct clusters for each label. For \(Y_{j}\) belonging to Y, assume that samples are constructed as \(P_{j}\), and the others are \(N_{j}\),

$$\begin{aligned} \begin{aligned} P_{j} = \{ x_{i} \vert (x_{i}, y_{ij}) \in D, y_{ij} = 1 \} \\ N_{j} = \{ x_{i} \vert (x_{i}, y_{ij}) \in D, y_{ij} \ne 1 \}. \end{aligned} \end{aligned}$$
(6)

A minority label in \(Y_{min}\) is expressed as \(IR > 1\), where \(IR = \vert N_{j}| / |P_{j} \vert \). MLCSMOTE defines the number of label-specific clusters, using \( \vert P_{j} \vert \) to fit the distribution of data under each label,

$$\begin{aligned} m_{j} = r \cdot \vert P_{j} \vert , \end{aligned}$$
(7)

where \(r \in [0,1] \) is a parameter to control the number of clusters. Some methods use k-medoids [59] to mitigate the sensitivity of k-means to noisy samples. We apply k-means to better capture samples corresponding to outlier instances. When the outlier is far from the distribution of \(P_{j}\), k-means tends to assign these samples to separate clusters based on the change in mean value while maximizing expectation.

By performing clustering on \(P_{j}\), we can obtain two properties for a cluster: the number of samples in \(P_{j}\) and the average distance \(d_{j}\) of positive instances to the center. In multi-label imbalance learning, our focus is on the information of the clusters corresponding to \(N_{j}\). It can determine to which cluster \(N_{j}\) belongs using the trained k-means model \(f_{j}\), where

$$\begin{aligned} CL_{i,j} = f_{j}(x_{i} ), x_{i} \in N_{j} \end{aligned}$$
(8)

represents the i-th sample to which \(x_{i}\) belongs in the j-th label. Based on this result, MLCSMOTE obtains the number of \(N_{j}\) within the cluster \(CL_{i,j}\), further get the score \(s_{j}^{m}\) for the m-th cluster in the j-th label,

$$\begin{aligned} s_{j}^{m} = d_{j}^{m} \cdot \frac{ \vert CL_{P_{j},j} \in m \vert }{ \vert CL_{N_{j},j} \in m \vert }, \end{aligned}$$
(9)

where \(d_{j}^{m}\) is the average distance in the j-th label and m-th cluster. There are two issues with Eq. (9). First is to describe the cluster density. A larger value of \(d_{j}^{m}\) indicates a lower concentration of the minority and a sparser distribution of clusters. Compared with dense clusters, which are better suited for neighborhood-based classifiers [55], increasing the weight for sparse clusters will help the classifier focus on this type of sample. Second is to calculate the imbalance ratio within a cluster. MLSCMOTE uses cluster prediction for \(N_{j}\) to describe IR for each cluster. The larger the IR, the clusters tend to belong to a safe or boundary area. When IR is extremely low, the weight of a generation in a cluster will decrease. In some clusters, where there is only one positive instance, the weight will be zero.

It is necessary to normalize \(s_{j}^{m}\). While \(x_{i}\) belongs to the m-th cluster in label j, it will assign \(s_{j}^{m}\) to \(x_{i}\). When label j does not belong to \(Y_{min}\), it is not included in Eq. (9) and the assignment of the corresponding x. In SMOTE operations, weights are required to fine-tune the generation weights for each minority instance. MLCSMOTE aims to further emphasize the generation of weights for extreme minority classes. An initial straightforward way to obtain the weight of instances \(x_{i}\) is using IR for different labels,

$$\begin{aligned} w_{i} = \sum _{j=0}^{j=q} \left[ s_{j}^{m} \cdot \frac{IR_{j}}{ \sum _{j=0}^{j=q} IR_{j}} \right] . \end{aligned}$$
(10)

MLCSMOTE uses all the minority labels to build \(w_{i}\), reducing the interference of IR from the majority labels. Obviously, there is no need to assign generation weights for instances where only majority classes are available. Compared with oversampling methods like MLSOL or EMSOL, MLCSMOTE does not require specific parameters to define outlier, boundary, and safe instances.

Attribute-level generation via smote

In the previous stage, we obtained the weights for the generated samples. In binary classification, cluster-based SMOTE selects samples in the same cluster as relative instances to increase the number of synthetic samples. This also helps prevent the generation of noisy instances. For multi-label data, relative instance selection is an important step in SMOTE operations. The majority of instances will disturb the synthetic instances. We propose using samples in label-specific clusters in the SMOTE operation,

$$\begin{aligned} x_{sys}{} & {} = x_{seed} + (1 - \lambda ) (x_{relative} - x_{seed}) \text {,}\nonumber \\{} & {} \quad \forall j \exists (CL_{seed,j} = CL_{relative,j}), \end{aligned}$$
(11)

where \(x_{sys}\) is the synthetic features, \(x_{seed}\) is the seed data sampling by Eq. (10), and \(x_{relative}\) in Eq. (11) is the neighbor and intra-cluster positive samples from \(x_{seed}\). We use kNN to obtain the set of neighbor instances \(x_{relative}\) for the minority sample \(x_{seed}\). If the sample set \(x_{relative}\) is in the label-specific cluster to \(x_{seed}\), it will be prioritized for SMOTE operations. Instances with the same minority-specific clusters are close to each other, and synthetic samples can be generated for samples without unrelated labels. We illustrate this in Fig. 1.

Fig. 1
figure 1

Attribute-level generation with label-specific cluster and distribution. First two dimensions behave as minority labels

Label revision via fuzzy rough set theory

We develop our label revision algorithm for synthetic samples using fuzzy rough set theory (FRST). FRST is applied to determine the relationship between synthetic instances and labels, and labels of synthetic instances with quality measures are revised.

R and \(R_{d}\) are calculated in Eq. (5). R establishes the relationship between synthetic samples and minority instances based on their attributes. \(R_{d}\) utilizes label correlation to establish the corresponding information between the minority and synthetic samples. We use the distribution \(p_{min}\) obtained from \(D_{min}\) to find \(R_{d}\), where \(D_{min}\) represents samples in minority classes,

$$\begin{aligned} \alpha&= \frac{\sum _{j=1}^{ \vert D_{min} \vert } h(j, Y_{j})}{\sum _{j=1}^{ \vert D \vert } h(j, Y_{j})} \end{aligned}$$
(12)
$$\begin{aligned} a&= \sum _{j \in L_{1} \bigcap L_{2}}(1- (p_{j} - \alpha * p_{min,j})) \end{aligned}$$
(13)
$$\begin{aligned} b&= \sum _{j \in L \backslash (L_{1} \cup L_{2}) }(p_{j} - \alpha * p_{min,j}) \end{aligned}$$
(14)
$$\begin{aligned} R_{d}&= \frac{a + b}{ a + b + \frac{1}{2} \vert L_{1} \bigtriangleup L_{2} \vert }, \end{aligned}$$
(15)

where \(L_{1}\) and \(L_{2}\) are labels in two different samples, and j traverses all labels. \(R_{d}\) is the similarity relation for the label set. The \(\bigtriangleup \) operator constructs the symmetric difference between its two arguments, while \( \vert \cdot \vert \) measures the cardinality of the resulting set. \(p_{j}\) is the prior probability of the j-th label in the entire dataset, and \(p_{min,j}\) is calculated based on \(D_{min}\).

The SMOTE operation in binary imbalance generates results that address the imbalanced distribution. FRST takes into account the minority label set to minimize the influence of the majority class on label assignments. Equation (12) describes the distribution of label j in \(D_{min}\). If the label is common in \(D_{min}\) and D, then minority labels are present, and the penalty for the label is reduced by \(\alpha \). Labels with more occurrences in \(D_{min}\) are rewarded by \(-(\alpha * p_{min,j})\).

Fig. 2
figure 2

Flowchart of MLCSMOTE-FRST

It has been noted that the noise phenomenon is common in imbalance learning [23], especially affecting the local information of samples belonging to extreme minority labels. We apply a noise-tolerant computing method in fuzzy rough set theory, with R and \(R_{d}\). Overall Weighted Average (OWA) [9] aggregation is used to calculate the upper and lower approximations of the membership degree between instance \(x_{1}\) with labels. We leverage positive samples within clusters to aid in computing OWA and determine the consensus label set, as illustrated in Eq. (16). Querying the consensus label set corresponding to the samples within the cluster enhances the relationship between synthetic data and the minority and intra-cluster samples. The upper bound of Eq. (17) represents the acceptance probability, while the lower bound of Eq. (18) represents the rejection probability of a fuzzy set.

$$\begin{aligned} N_{seed}(x)&= \{ P_{j}(x) \vert j \in Y_{min}, CL_{seed,j} = CL_{neighbor,j}\} \end{aligned}$$
(16)
$$\begin{aligned} \overline{C}(x_{sys})&= OWA_{W_{U}}(\mathcal {T} (R(x_{3}, x_{sys}),\nonumber \\&\quad R_{d}(x_{2}, x_{3}) \vert x_{3} \in N_{seed}(x) ) \end{aligned}$$
(17)
$$\begin{aligned} \underline{C}(x_{sys})&= OWA_{W_{L}}(\mathcal {I} (R(x_{3}, x_{sys}),\nonumber \\&\quad R_{d}(x_{2}, x_{3}) \vert x_{3} \in N_{seed}(x) ), \end{aligned}$$
(18)

where \(N_{min}(x)\) represents the k neighbors around synthetic sample \(x_{sys}\), which includes minority labels. \(W_{U}\) and \(W_{L}\) are the respective minimum and maximum operators. As \(W_{L}\), we use the exponential weight in the comparative experiment, and \(W_{U}\) is the inverse of \(W_{L}\). In FRST, quality measure Q has calculated scores that are larger than the average value by \(\overline{C}\) and \(\underline{C}\) is considered to be the label. Synthetic samples retain the label of their interpolated seed instances to strengthen the capability of the borderline. The process of MLCSMOTE-FRST is shown in Fig. 2.

Computational complexity

The algorithm can be divided into the stages of calculating the distribution of samples for each label, performing k-means clustering on minority classes, conducting SMOTE operations to generate synthetic instances, and revising labels. The computational complexity of labels can be expressed in terms of O(qn) to determine whether the label represents a minority and the corresponding IR. The k-means algorithm is known to have a complexity of O(knT), where T is the number of iterations and k is the number of clusters. With q labels, the complexity is O(qknT). In the SMOTE algorithm, to find the nearest neighbor samples of minority instances has computational complexity \(O(n^{2}d+kn^{2})\). After obtaining the data of kNN, the generation process has computational complexity \(O(pn(q+d))\), where p is the predetermined generation quantity.

The computational complexity of attribute generation is \(O(qn + qknT + n^{2}d+kn^{2} + pn(q+d))\). Usually, n is larger than q and d, so the main complexity of MLCSMOTE lies in kNN.

In the process of label revision, kNN is computed in the previous step, and OWA has computational complexity \(O(k\log (k))\) due to the sorting step in its definition. With OWA computed, FRST must construct the revised label using \(N_{min}\) and label set m. The computational complexity for FRST is \(O(N_{min}dmk\log (k))\).

In terms of algorithmic complexity, the oversampling algorithm can proceed by first defining the minority samples, generating samples using SMOTE, and implementing the proposed label correction method. LP-ROS, ML-ROS, and REMEDIAL focus on defining minority samples. LP-ROS and REMEDIAL utilize mean size and SCUMBLE to define the minority class and perform sample cloning, respectively. Following these two steps, the algorithmic complexity can be approximated as \(O(qn+n)\). ML-ROS requires the generation of a proportion p for control, with algorithmic complexity O(qnp). MLSMOTE and MLSOL require the additional use of kNN to find SMOTE samples. This, in combination with the step of defining the minority samples, has computational complexity \(O(qn+n^{2}d+kn^{2})\) for the former step. The latter additionally requires the computation of weights, with complexity \(O(qn+n^{2}d+kn^{2}+nkq)\). In contrast to MLSMOTE, MLCSMOTE-FRST adds a computational process of fuzzy rough sets.

The pseudocode is shown in Algorithm 1.

Algorithm 1
figure a

MLCSMOTE-FRST

Table 1 Details of dataset

Experiments

Experimental framework

Twelve datasets provided by scikit-mutilearn [44] were used to verify the effectiveness of MLCSMOTE-FRST. The split ratio between the training and testing sets ranges from 0.51 to 9. These include a high feature dimension of 47236, and a high label dimension of 174. There are significant differences in the distributions and the dimensions of labels observed in our validation dataset, which can effectively validate our algorithms.

Table 2 MLSMOTE-FRST and reference methods on RAkEL classifier

We also use MeanIR, CVIR [3] and MeanIMR, CVImR [25] for the training set to analyze the statistical changes that occur after applying the data generation methods. MeanIR and MeanIMR quantify the degree of imbalance. MeanIR focuses on the ratio of the maximum label’s number to each label, while MeanIMR is the mean of positive and negative ratios under each label. MeanIMR is more relevant to our study, as it applies to both positive and negative samples. CVIR and CVImR represent the variances of individual labels for MeanIR and MeanIMR, respectively, which describe the differences in label distributions. Smaller values of MeanIR and MeanIMR indicate a lower degree of imbalance, and smaller values of CVIR and CVImR indicate less variation in the number of labels. The dataset information is shown in Table 1.

Due to MLCSMOTE-FRST focus on validating metrics related to minority groups and multi-label classification, it can be used macro-F1, and micro-F1 and check the minority-related performance. Macro-F1 is more sensitive to minorities because it considers all labels as equal and calculates the average F1-Score. Micro-F1 is used to assess the impact of synthetic samples by calculating the accuracy and recall for each sample. We also use example-based accuracy (ACC) and hamming loss to measure the overall accuracy and percentage of incorrectly predicted labels:

$$\begin{aligned} Macro-F1(\uparrow )&= \frac{\sum _{1}^{j}(\mathrm {F1-score})}{j} \end{aligned}$$
(19)
$$\begin{aligned} Micro-F1(\uparrow )&= \frac{TP}{TP + 0.5 * (FP + FN)} \end{aligned}$$
(20)
$$\begin{aligned} ACC(\uparrow )&= \frac{1}{n} \sum _{i=1}^{n} \frac{ \mid y_{i} \bigcap \hat{y}_{i} \mid }{\mid y_{i} \bigcup \hat{y}_{i} \mid } \end{aligned}$$
(21)
$$\begin{aligned} Hamming\;loss(\downarrow )&= \frac{1}{nq} \sum _{i=1}^{n} \sum _{j=1}^{q} h(y_{i,j} \ne \hat{y}_{i,j} ) \end{aligned}$$
(22)

where TP, FP, and FN are the sums of the corresponding values of each class, y and \(\hat{y}\) are the true and predict values.

We demonstrate the impact of MLCSMOTE-FRST on various datasets and classifiers by comparison with other multi-label oversampling algorithms divided into two groups: repetitive oversampling (LP-ROS, ML-ROS, REMEDIAL) and SMOTE-based oversampling (MLSMOTE, MLSOL), which have been described in section “Imbalanced multi-label learning”. We used four classifiers to validate the generated data metrics. These include the problem transformation methods of binary relevance (BR) [56], random k-labelsets (RAkEL) [46], and classifier chains (CC) [37]; and an algorithm adaptation method, multi-label k-nearest neighbor (MLkNN) [55]. Random forest was used as the underlying classifier where needed. The parameters for classifiers were the default settings supported by scikit-mutilearn. All experiments were run five times, and metrics were averaged to minimize the impact of random numbers.

The above oversampling methods all used an oversampling ratio of 0.25. MLSMOTE, MLSOL, and MLCSMOTE are SMOTE-based, with the number of neighbors set to 10. To validate the accuracy of the attribute-level method, we employed a similar approach to MLSMOTE, referred to as MLSCMOTE, to assign labels to the generated data in section “Attribute-level generation via smote”. The cluster ratio r in MLCSMOTE-FRST was set to 0.1.

Results of sampling methods

Due to space limitations, the full experimental results are presented in the supplementary material,Footnote 1 including the classification performance of MLCSMOTE-FRST against eight baselines in 12 benchmarks on four classifiers and three evaluation metrics. We list the metrics of the relevant algorithms under the RAkEL classifier in Table 2. RAkEL achieves the optimal average macro-F1 and micro-F1 compared with other classifiers without the need for synthetic samples. Similar results are found in [40]. Results of the Friedman test is list in Table 3. Nemenyi post hoc analysis with mean rankings are listed in Tables 4 and 5.

Table 3 p-value for Friedman test

Performance of macro-F1 and micro-F1. RAkEL, as the optimal base classifier, refers to the in-depth analysis. MLCSMOTE-FRST achieved the top macro-F1 and micro-F1 scores in the nine datasets, demonstrating strong competitiveness. Only in Scene and Medical, there exist decreases in metrics relative to MLCSMOTE, reflecting the effectiveness of using fuzzy rough set theory. MLCSMOTE is generally considered inferior to MLCSMOTE-FRST but superior to other methods, particularly in terms of macro-F1. It has been shown that the use of label-specific clusters to locate the generated positions and proportions is effective. REMEDIAL, however, has the worst performance for metrics related to minorities. The direct cloning of samples and assignment of different labels has a negative impact on metrics that are sensitive to sample labels.

Performance of ACC and hamming loss. MLCSMOTE-FRST significantly enhances ACC, with the highest average ranking in all four classifier experiments. After applying label revision, improvements were observed in all seven datasets in Table 2. This demonstrates that our algorithm prioritizes the performance of the minority class and reduces false-positives of the majorities. Hamming loss, as a global metric for multi-label problems, is typically minimally affected by oversampling algorithms. MLCSMOTE-FRST shows limited improvements. By modifying labels using fuzzy rough set theory, MLCSMOTE-FRST improves the hamming loss compared with MLCSMOTE on 10 datasets.

Table 4 Mean ranks of evaluated algorithms for Macro-F1 and Micro-F1
Table 5 Mean ranks of evaluated algorithms for ACC and hamming loss (HL)

To further verify the impact of base classifiers on different metrics, we present the p-value of the Friedman test in Table 3. There is a significant difference in the results between algorithms when the p-value is less than 0.05.

The p-value indicates significant differences across all classifiers. The mean ranks and post hoc test results are presented in Tables 4 and 5.

MLCSMOTE-FRST achieves optimal average rankings in both macro-F1, micro-F1, and ACC, reflecting its effectiveness under different base classifiers. Macro-F1 performs better than micro-F1, showing a greater improvement in the minority-sensitive metric. In Table 5, MLCSMOTE-FRST demonstrates superior accuracy in example-based scenarios. MLCSMOTE demonstrates superiority in the three base classifiers.

Parameter analysis

An additional study was conducted to explore the effect of parameters in MLCSMOTE-FRST. The parameters of the proposed method include the oversampling ratio p and the number of neighbors k. We analyzed the Yeast dataset. The problem of imbalance in datasets has been extensively studied, for multi-label problems as well as binary problems, such as the keel dataset [1]. We used RAkEL as the classifier for the analysis because it outperformed other classification algorithms on average across the datasets (Fig. 3).

Fig. 3
figure 3

Results of different parameters

As p increases, the performance of MLCSMOTE-FRST improves, with a greater enhancement observed in micro-F1. By contrast, the change for MLSMOTE-FRST is more stable. In multiple rounds of the experiment, the corresponding indicator has a small standard deviation. Both achieve the optimal macro-F1 and micro-F1 at \(p=0.2\). Similar results are reported for ML-ROS and MLSOL. We also found that the performance of ML-ROS at a significance level of \(p=0.05\) was higher than at other tested values.

In contrast to the change in p, the average change of k is more consistent. When k is greater than 15, the indicator shows a decreasing trend. As the neighborhood expands, the range of generated synthetic samples increases. This may result in the generation of additional noisy data, due to the lack of correlation among the neighbors with the seed samples. The performance of MLCSMOTE-FRST at \(k=15\) is higher than at \(k=10\) and 20. The SMOTE-based oversampling method can set a smaller value of k, such as 5 or 10, to obtain the optimal metric. The standard deviation of MLCSMOTE-FRST is smaller than that of the other algorithms for all values except \(k=10\). Combined with the average values, our proposed algorithm at \(k=15\) outperforms the other models and has a more consistent impact on the classification model.

Influence for imbalanced statistical indicators

Although IR is not directly associated with the improvement of classifiers, it can describe the overall multi-label distribution. We calculated the percentage for IR-related metrics of differences between the original metrics in Table 6.

Table 6 Average ratio changes of differences for IR-related metrics

Due to REDEMIAL cloning of the sample and the decoupling process, the overall number of labels does not change, and the statistical analysis of MeanIR and CVIR remains consistent. A positive change was observed for MLSMOTE, indicating that it does not guarantee a reduction in the imbalance ratio for the oversampling method. Our proposed MLCSMOTE and MLCSMOTE-FRST algorithms can significantly reduce both MeanIR and MeanIMR compared with MLSMOTE and MLSOL, while being slightly outperformed by ML-ROS. The direct assignment samples for minority labels by ML-ROS result in a faster decrease in MeanIR and MeanIMR, which are strongly related to the minority labels. It can be seen from Tables 4 and 5 that directly cloning samples is of limited assistance in improving classification performance. The values of CVIR and CVImR for MLCSMOTE and MLCSMOTE-FRST were also reduced, indicating a decrease in the dispersion of individual labels. We also note that the IR-related metrics of MLCSMOTE and MLCSMOTE-FRST were similar, both showing a decreasing trend.

Ablation experiments

Similar to section “Parameter analysis”, we used the same classifier as a benchmark for the ablation experiment. We present the metrics for incorporating FRST into MLCSMOTE in Table 2. In Table 7, the three SMOTE-based methods with the addition of FRST as a postprocessing method are presented for the minority class-sensitive classification metrics macro-F1 and micro-F1.

Table 7 Classification performance, sensitive to minority labels, after incorporating FRST in postprocessing
Table 8 Adding FRST as a postprocessing method for changes in label statistical values

In Table 7, MLSMOTE and MLCSMOTE show 11 datasets where the metrics are not degraded. For MLCSMOTE, there are 10 datasets where macro-F1 is enhanced, indicating the effectiveness of our postprocessing approach. MLCSMOTE-FRST achieves better results than MLSMOTE-FRST and MLSOL-FRST in half of the datasets. Table 8 presents the variations of the labeling statistics MeanIR and MeanIMR.

It can be observed that the addition of FRST results in a decreasing trend in MeanIR and MeanIMR across most datasets. This demonstrates the effectiveness of the postprocessing approach in achieving a balanced distribution of labels. Although there is no obvious correlation between the label distribution and classification performance, Tables 7 and 8 demonstrate that our proposed postprocessing approach has clear advantages for both classification performance and label distribution.

Discussion

The proposed algorithm significantly improves performance related to minority classes, such as measured by macro-F1 and micro-F1, across several classifiers in the experiment. From the results of ablation experiments, this is attributed to the combined determination of both feature and label assignments. Even if we disregard the process of feature assignment and include FRST in traditional methods like MLSMOTE and MLSOL, the statistical values related to both classification and labeling distributions demonstrate significant improvement. Again, by combining the changes in the statistical values, our proposed method performs well at optimizing the distributions in terms of the statistical metrics of labels in MeanIR.

From the perspective of model implementation, although FRST can be added to most SMOTE-based generative methods, MLSMTOE, MLSOL, and MLCSMOTE have high algorithmic complexity in feature assignment, mainly for SMOTE, according to the analysis in section “Computational complexity”. From the perspective of algorithmic complexity, MLCSMOTE-FRST is the recommended generation algorithm. Combined with the classification results, MLkNN typically exhibits weaker performance compared with other classifiers. In our experiments, Rakel performed better. Increasing both p and k will elevate the complexity of the algorithm. This must be tailored to the dataset in practical scenarios to align with training time requirements.

Conclusion

We presented a method for oversampling multi-label data, which utilizes label-specific clustering and fuzzy rough set theory. Our objective was to develop a more efficient approach for generating synthetic data to establish the correspondence between the attribute and label levels for minority-label samples. Compared with other methods of weight assignment, MLCSMOTE-FRST can define the samples using a more specific clustering approach for minority labels, introducing fewer manual parameters. MLCSMOTE-FRST clusters samples belonging to minority labels using k-means and predicts the remaining samples to obtain clustering information specific to the labels. With this clustering information, IR of each minority label is assigned to the corresponding sample to be generated. The clustering results are used to selectively choose the intra-cluster samples for the SMOTE operation. The labels are supported by the intra-cluster samples and optimized using fuzzy rough set theory to capture the consensus label set.

In experiments on 12 datasets with four classifiers, MLCSMOTE-FRST achieved the highest average macro-F1 and micro-F1 metrics when compared with five other generative algorithms. To specify classifiers, RAkEL had better imbalance-related metrics than other classifiers. After applying MLCSMOTE-FRST, these showed a significant improvement.

In future work, while highlighting the advantages of label-specific cluster-based SMOTE, we can introduce label correlation information to further explore the influence between labels on the oversampling method.