1 Introduction

Multi-label classification is an important component of machine learning [53], The task is challenging due to the large output space of the label set [34]. Missing labels are highly likely to occur in multi-label datasets due to the high labeling cost or the carelessness of human labelers, and each sample may only receive a certain subset of labels. Multi-label learning with missing labels problem (MLML) has gained widespread attention and is crucial for addressing missing labels [44]. The reason for missing labels is widely discussed [34, 41] considers that incorrect labels are less likely to occur, and the abundance of high-frequency labels can hinder the observation or accurate labeling of low-frequency labels. Improving the performance of multi-label classifiers for minority labels in MLML has become a practical issue that needs to be addressed.

Recently, there have been several approaches to multi-label classification. For example, problem transformation [15, 38, 54] transforms multi-label classification into a binary classification form. Algorithm adaptation, such as MLknn [52] and Rank-SVM [24], optimally adapts machine learning to the algorithmic process. In MLML, the main assumption is that the labels are correlated with each other, which provides opportunities for matrix decomposition based on lasso regression [17]. For the imbalance MLML (imMLML), current research focuses on optimization at the algorithmic level. For example, [47] provides information about missing labels through the concept of label smoothing, which is similar to downsampling to control the number of positive instances for labels. This method has also been applied to facial action unit recognition [29]. Cheng et al. [11] utilizes inter-label information entropy to complement labels and prevent the effects of data imbalance. However, these optimizations are primarily developed at the algorithmic level, [29, 47] are mainly focus on specific LLEs, while [11] applied them to extreme learning machines. These approaches have limitations for imMLML in various classifiers, hindering their broader application.

The data level is an important research topic in imbalanced problems. Synthetic instances of the minority labels can be generated using methods such as SMOTE [9], GAN [16] or VAE [23], which can be applied in different classifiers. REMEDIAL [8] and ML-ROS [6] achieve an increase in the number of minority labels by replicating minority samples for different label assignments. MLSMOTE [7] and MLSOL [31] generate synthetic instances using SMOTE and assign synthetic labels using neighbor information. However, imMLML is a common phenomenon in the multi-label field, and the main over-sampling focuses on the definition of labels [6, 8] and the selection of the generation position [7, 31]. Although there are methods involving using fuzzy rough sets to define labels [33], there is still a significant challenge in effectively filling in the missing labels for synthetic samples.

To recover missing labels in imMLML and to increase the number of minorities, we propose the oversampling method \({\textbf{M}}\)ulti-\({\textbf{l}}\)abel with \({\textbf{A}}\)uxiliary \({\textbf{W}}\)eight \(\textbf{SMOTE}\) (MLAWSMOTE) based on two matrix decomposition weights, which learn from instance-label and label correlation, respectively. The former is derived from the variance between observed and retrieved labels, while the latter is acquired from the label distribution. The ratio for synthetic samples is generated based on the recovered and observed labels and label distributions to prioritize samples with missing labels and low frequency. The labels for synthetic samples are determined by the instance-label weight. We validate the effectiveness of our proposed algorithm by combining various datasets, missing label ratios, and classifiers collectively.

This paper makes the following contributions:

  • We propose an oversampling approach using label correlation weights to address label missing and imbalance;

  • We combine label distribution and label recovery differences to enhance samples with low frequency and high label missing probability;

  • We propose using weights to complement the labels of generated instances, aiming to optimize the performance of multi-label classifiers.

The remainder of this paper is divided into five sections. Section 2 is the related work including MLML and imMLML. Section 3 introduces our proposed method. Section 4 is the experimental setting and analysis. Section 5 is the conclusion.

2 Related Work

2.1 Multi-label Learning with Missing Labels

In real-world scenarios [1, 3, 4, 25, 26], accurate labeling of samples in multi-label problems is often challenging due to the complexity of the label output space. Consequently, only a subset of the correct labels can be obtained, leading to the phenomenon known as MLML. [34] suggests that MLML can be categorized into implicit and explicit types based on the causes of the problem. Implicit MLML considers that labeling part of the positive class as a negative class is due to label loss. Explicit MLML suggests that the unobservable labels are the main reason. Among these, the former is more likely to occur in the problem of imbalance, as minority labels are typically harder to detect. [41] also shows that the proportion of missing tail labels is larger than the proportion of missing head labels. Matrix decomposition and graph-based methods are the primary techniques employed to extract label correlations from the dataset and to impute missing labels.

Table 1 Detail of previous work

In multi-label learning, the label matrix is usually considered low-rank due to the presence of label correlations, and the low-rank matrix is also a general way to complete the missing labels. The optimization of algorithm convergence is also an important part of prediction application [1, 2, 36, 40]. For example, [21] uses the label correlation matrix to capture the influence of different labels and selects shared sparse feature structures between labels by applying the l-1 and l-2 penalties. LRMML [27] conducts a study on matrix decomposition in explicit machine learning. They utilized the graph Laplacian of the label association matrix to represent the idea that the more relevant the labels are, the more similar the label-specific weights. In terms of weight optimization, C2ML and C2MLE [44] apply a clustering approach. They use fuzzy c-means to obtain cluster centroid and granular discrimination-based regularization to optimize weights for inter-class and intra-class distance sensitivity. [12] considers that the previous label correlation matrix ignores the information of the instance and assists in establishing the relationship with the missing label by constructing the attention distribution matrix and the fusion matrix. Zhang et al. [50] extends the matrix decomposition-based method to semi-supervised MLML using both unlabeled and labeled data to construct a feature manifold regularizer for learning local label correlations. In addition to assuming a low rank for the label correlation matrix, ColEmbed [18] considers that both the recovered feature matrix and the classifier are low-rank. It also applies cost-sensitive matrix factorization to address the imbalance problem.

Graph-based methods utilize label correlation to construct a weighted graph for predicting missing labels. Wu et al. [47] establishes graph networks based on the similarity of samples and labels while [10] accomplishes it by correlation distances in features and labels. Hashemi et al. [20] combines labeled directed graph dependencies with image classification, while also incorporating word embedding. Beside these, methods such as evolutionary algorithm [30] and information entropy [19] are also solutions for MLML.

2.2 Imbalanced Multi-label Learning

Class imbalance distribution is a widespread phenomenon in multi-label datasets. Improvements are mainly made at the algorithm level and data level. At the algorithm level, methods like [46, 48] improve the classification performance in specific tasks.

Although imbalanced label distribution is a common problem in MLML, data distribution can be improved by downsampling and oversampling. For example, MLeNN [5] and MLTL [37] draw on the Tomek Link method to eliminate majority classes that overlap each other. Roseberry et al [39] incorporate the idea of downsampling into the identification of data streams, quickly adapting to changes in data streams by removing data.

The oversampling algorithm expands the minority samples by generating synthetic samples. Generating synthetic instances from the feature level is also the primary method in dichotomous and multi-classification. MLSMOTE and MLSOL involve interpolating weights from various minority groups. EMLSOL [32] oversamples the minorities and downsamples the majorities. LCOS [51] considers label correlations and assigns weights to synthetic instances based on their distance information. Due to the multi-label property, methods like LP-ROS and ML-ROS perform repetitive oversampling by defining the minority labels differently, while REMEDIAL clones the samples and decouples the labels. The detail of literature review is in Table 1, it is common to lose minority labels in the multi-labeling dataset. However, there is a lack of oversampling algorithms to address the class imbalance resulting from this issue.

3 Proposed Method

In our work, we assume that the data set D contains n instances \(\{(x_{1},y_{1}),...,(x_{n},y_{n})\}\), where \(x \in R^{d*n}\) and d is the data dimension, y in \(\{-1,1\}^{q*n}\) and q is the label dimension. As a general rule, it is more challenging to accurately identify the labels of minority groups. y in \(-1\) contains negative or missing labels. Our goal is to improve the prediction of minority class labels in imMLML. The architecture of our proposed algorithm is shown in Fig. 1, and we elaborate on it in this section.

Fig. 1
figure 1

The framework of MLAWSMOTE. Blue represents a label of 1, while green represents a complementary potential label. Assuming that the 2nd and 4th labels are minorities. \(\Omega \) is using to construct ideal recover labels. W is used to fit the ideal label and design the generation ratio and recovery label for the synthetic instances

3.1 Matrix Factorization Weights Construction

In previous works [27, 35, 42], the correspondence between samples and labels is assumed to be a linear function, ie., \(Y = WX\) and \(W=[w_{1},w_{2},...,w_{q}]^T\) where W in \(R^{q*d}\). Each column in W is responsible for mapping the data dimensions to a specific label, where \(w_{ij}\) indicates the contribution of the j-th dimensional feature to the i-th label. To determine the feature-label correspondence in the MLML, W typically utilizes l1 or l2-norm penalty for constraint. Such a sparsification operation restricts the W corresponding to certain labels to a smaller value to achieve feature selection. Assuming Y represents the observed labels and \(\Omega Y\) represents the recovered labels. By doing so, WX can map the features to the ideal labels \(\Omega Y\):

$$\begin{aligned} \min \limits _{W} \frac{1}{2} \Vert \Omega Y - WX \Vert ^{2}_{F} + \lambda _{1} \Vert W \Vert ^{2}_{F} \end{aligned}$$
(1)

where the former term implements the fit \(\Omega Y\) by W, while the latter term controls the complexity of W. \(\lambda _{1} \ge 0 \) is its tradeoff parameter, \(\Vert \cdot \Vert _{F}\) represents the Frobenius norm.

Kumar and Rastogi [27], Sun et al. [42] point out that a linear self-recovery matrix \(\Omega \) can be learned from the observed label Y, and \(\Omega \) is based on three assumptions. The first observation is that the observed variable Y should be identifiable in the recovered variable \(\Omega Y\) because Y is true when it equals one. Also, \(\Omega \) should be a low-rank matrix because labels are correlated with each other in multi-label learning. \(\Omega \) being low-rank is advantageous for filling in missing labels. The last one is \(\Omega \) should be greater than zero. We assume that labels are positively influenced by each other and do not intend for labels that are already positive to be corrupted by \(\Omega \).

Since the issue of missing labels is prevalent in imbalanced distributions, it is essential to enhance W with guidance from the label correlation matrix \(\Omega \). If two labels \(l_{i}\) and \(l_{j}\) have a high correlation, then \(w_{i}\) and \(w_{j}\) are also more correlated, indicating a similar Euclidean distance. For the label correlation matrix \(\Omega \), the feature-specific weights can be constrained by the Laplacian matrix \(L \in R^{q*q}\), it will help W learn from second-order label correlation. Accordingly, the way to build W and \(\Omega \) expressed as follows:

$$\begin{aligned} \begin{aligned}&\min \limits _{W,\Omega } \quad \frac{1}{2} \Vert \Omega Y - WX \Vert ^{2}_{F} + \lambda _{1} \Vert W \Vert ^{2}_{F} + \lambda _{2} \Vert \Omega Y - Y \Vert ^{2}_{F} \\&\quad + \lambda _{3} \Vert \Omega \Vert _{1} + \lambda _{4} tr(WLW_{T}) \\&\quad \; s.t \; \Omega \ge 0 \\ \end{aligned} \end{aligned}$$
(2)

where \(\lambda _{1}\) controls label-specific feature and the larger the more sparse feature is. \(\lambda _{2}\) controls loss for the observed and recovered labels. \(\Vert \cdot \Vert _{1}\) represents \(l_{1}\) norm regularizer to learn sparse label dependencies. \(\lambda _{3}\) controls the sparsity of correlation between different labels. \(\lambda _{4}\) controls the distance between model coefficients for any two labels. The optimization process is in the Sect. 3.2.

3.2 Matrix Factorization Weights Optimization

Equation 2 is convex and non-smooth due to it has \(l_{1}\)-norm terms. It can be solved by accelerated proximal gradient method [28] in an alternating way.

\(\mathbf {Step\;1.}\) Fixing \(\Omega \), the derivative for the first and last term involved W in Eq. (2) is:

$$\begin{aligned} f_{w} = W X X^{T} - \Omega Y X^{T} + \lambda _{4}(L + L^{T})W \end{aligned}$$
(3)

The term of \(l_{1}\)-norm for W can be solved by the element-wise soft-threshold operator as:

$$\begin{aligned} prox_{\epsilon }(w) = \max (\vert w \vert - \epsilon _{1}, 0){\text {sign}}(w) \end{aligned}$$
(4)

where \(\epsilon _{1}\) is a small value to activate w. In t-th epoch, W can be updated by:

$$\begin{aligned}{} & {} W^{t} = W_{t} + \theta (W_{t} - W_{t-1}) \end{aligned}$$
(5)
$$\begin{aligned}{} & {} W_{t+1} = {\text {prox}}_{\epsilon _{1}} (W^{t} - \theta f_{w}(W^{t}) ) \end{aligned}$$
(6)

where \(\theta \) is the step size. The update of W is realized by Eqs. (3)–(6) when \(\Omega \) is fixed.

\(\mathbf {Step\;2.}\) With W fixed, the derivative in Eq. (2) for \(\Omega \) in the first and third term is:

$$\begin{aligned} f_{\Omega } = (1 + \lambda _{1}) \Omega YY^{T} - WXY^{T} - \lambda _{1} YY^{T} \end{aligned}$$
(7)

For the \(l_{1}\)-norm and nonnegative constraint for \(\Omega \), it can be solved by the element-wise soft-threshold operator as:

$$\begin{aligned} {\text {prox}}_{\epsilon }(\Omega ) = \max (\vert \Omega \vert - \epsilon _{2}, 0) \end{aligned}$$
(8)

From literature [22], \(\epsilon _{1}\) and \(\epsilon _{2}\) are related to \(\lambda _{1}\) and \(\lambda _{3}\). To reduce the impact of parameter settings, we can set \(\epsilon \) equal to the corresponding \(\lambda \) respectively. Combine with Eq. (7) and (8), the update method for \(\Omega \) in the accelerated proximal gradient is:

$$\begin{aligned}{} & {} \Omega ^{t} = \Omega _{t} + \theta (\Omega _{t} - \Omega _{t-1}) \end{aligned}$$
(9)
$$\begin{aligned}{} & {} \Omega _{t+1} = {\text {prox}}_{\epsilon _{2}} (\Omega ^{t} - \theta f_{\Omega }(\Omega ^{t}) ) \end{aligned}$$
(10)

By alternately training steps 1 and 2 until the change in weights reaches a threshold. When we acquire W and \(\Omega \), they can help determine the missing label degree for samples, which is valuable for creating synthetic data.

3.3 Synthetic Data Generation Ratio

When the auxiliary weights W and \(\Omega \) are determined, data distribution and label inconsistency impact the confidence scores of the samples. The data distribution focuses on the generation proportion for the minorities, while prediction error underscores the significance of generation for samples with high probability of missing labels. Based on these considerations, we weighted the data generation for each minority sample based on their label distribution and prediction loss.

$$\begin{aligned} g_{i} = \sum _{j=1}^{j=q} \frac{ 1 - \beta }{ 1 - \beta ^ {n_{j}}} \Vert Y - WX \Vert ^ {2} \end{aligned}$$
(11)

where \(n_{j}\) is the effective number of labels and \(\beta \) is smaller than 1. The minority samples are weighted by a factor of \((1-\beta )/(1-\beta ^{n_{j}})\) based on the inverse label frequency. We set the hyperparameter \(\beta \) to 0.99. Due to W being learned from the recovered label \(\Omega Y\), when the value of \(\Vert Y - WX \Vert ^ {2}\) is larger, it indicates that the probability of mislabeling this sample is higher. Based on the two parts, Eq. (11) can address labels with minority and missing values. Similar to MLSOL, MLAWSMOTE defines minority labels as those that constitute less than half of n. When the sample contains minority labels, the operation of Eq. (11) is performed, and the proportion g is generated after normalization.

3.4 SMOTE Operation with Auxiliary Weights

In multi-label learning, there are varies methods to assign values to synthetic data labels, including copy labels [8], knn statistical assignment [7], SMOTE dependent sample assignment [31]. However, by assigning values that depend on the original observed label, no additional information can be provided while the dataset is imMLML. Leveraging neighbor information for label assignment has a restricted capability to supplement absent labels and influences the capacity to strengthen boundaries.

We consider WX capable of complementing labels based on label correlation. When dealing with synthetic data created by SMOTE, we calculate predicted labels by multiplying WX when the distance is near the seed instance, and then combine them with the corresponding seed sample. It is ensured that the synthetic instance is complemented with any possible missing labels based on the obtained seed sample and its observed labels. The synthetic instances are featured and labeled as shown in Eqs. (12) and (13).

$$\begin{aligned}{} & {} xs = x + r \cdot (xn - x) \end{aligned}$$
(12)
$$\begin{aligned}{} & {} ys = y \cup (W \cdot xs) \end{aligned}$$
(13)

where x is the seed sample of SMOTE, xn is its neighbour. xs and ys represent the feature and label for synthetic instances, while r denotes a random value. It is worth noting that the random value r of SMOTE should preferably be in the range of [0, 0.5] to avoid assigning different labels in the region of [0.5, 1], as the generated SMOTE samples may also belong to the minority class. Overall, the pseudocode is presented in Algorithm 1 and the flow chart is in Fig. 2. From Fig. 2, MLAWSMOTE is mainly divided into three parts, defining minoirty labels via IR; training W and \(\Omega \) in an iterative way; use the weight W to define generate ratio and SMOTE operation.

Fig. 2
figure 2

The flow chart of MLAWSMOTE

3.5 Computational Complexity and Convergence Analysis

MLAWSMOTE can be divided into the following stages: acquiring the auxiliary weights W and \(\Omega \), calculating the generation ratio g and generating the synthetic samples. The most time-consuming complexity during the iteration is computing W and \(\Omega \). The complexity of computing the gradient of W is \(O(nd^{2} + qd^{2} + dnq + dq^{2})\), and it requires \(O(nq^{3} + q^{3} + dq^{2} + dnq)\) to obtain the gradient of \(\Omega \). The complexity to calculate the generation ratio is O(n) and SMOTE requires \(O(n^{2}d + kn^{2} + pn(q+d))\), where the first two represent the time complexity of computing knn. The former refers to the complexity of the algorithm, while the latter pertains to the complexity of generating synthetic data based on the overrall generated ratio p.

In terms of algorithmic complexity, the oversampling algorithm can proceed by first defining the minority samples, generating samples using SMOTE, and implementing the proposed label correction method. LP-ROS, ML-ROS, and REMEDIAL focus on defining minority samples. LP-ROS and REMEDIAL utilize mean size and SCUMBLE to define the minority class and perform sample cloning, respectively. Following these two steps, the algorithmic complexity can be approximated as \(O(qn+n)\). ML-ROS requires the generation of a proportion p for control, with algorithmic complexity O(qnp). MLSMOTE and MLSOL require the additional use of kNN to find SMOTE samples. This, in combination with the step of defining the minority samples, has computational complexity \(O(qn+n^{2}d+kn^{2})\) for the former step. The latter additionally requires the computation of weights, with complexity \(O(qn+n^{2}d+kn^{2}+nkq)\). The computational complexity for MLCSMOTE-FRST is \(O(qn+n^{2}d+kn^{2})\) plus \(O(qn+n^{2}d+kn^{2}+nkq)\) due to it has two steps, which are focus define the generation location and revise the labels. Compared with other algorithms, MLAWSMOTE mainly lies in the time spent on the optimization of W and \(\Omega \). Fortunately, the optimization method such as [55], which uses the alternating direction method of multipliers to reduce the time of parameter selection, can also be used in the process of MLAWSMOTE parameter optimization.

According to the definition in [42], the parameters controlled by \(\lambda _{1}\) and \(\lambda _{3}\) are convex but nonsmmoth, while the rest can be regarded as convex. For Linear inverse problems, the convergence analysis of iterative shrinkage-thresholding algorithms has been proved in previous literature [13, 14]. To further illustrate the convergence of MLAWSMOTE, we plotted the difference curves of WX and \(\Omega Y\) for different datasets under a \(30\%\) minority label missing rate to show that W and \(\Omega \) are effective. In Fig. 3, it can be observed that, except for the medical dataset which started to oscillate, the other datasets maintained excellent convergence.

Fig. 3
figure 3

The convergence curves of MLAWSMOTE on six data sets under \(30\%\) minority label missing rate

Algorithm 1
figure a

MLAWSMOTE

4 Experiments

To demonstrate the superiority and stability of our proposed algorithm, we proceed from the following three research questions.

\(\mathbf {Research\;question\;1}\): Compare MLAWSMOTE and other data generation algorithms in the absence of minority class labels in the training set, and imbalance related metric changes between classifiers.

\(\mathbf {Research\;question\;2}\): Based on the above classifiers and the compared generation algorithms, the nonparametric tests is validated for difference minority labels missing ratio.

\(\mathbf {Research\;question\;3}\): The parameter sensitivity of MLAWSMOTE is demonstrated by analyzing the classification results.

We analyze the above research questions in the Sects. 4.24.4 and before that we present the datasets and the compared generation algorithms and the different classifiers.

4.1 Experimental Framework

To verify the effectiveness of the algorithm, we conducted six datasets with varying ratios of missing minority labels. The domains in datasets include music, audio, text, biology, and images. The split for training and testing was supported by scikit-multilearn [43]. Similar to the matrix decomposition in related papers [49, 56], for a dataset where \(n > q\), we use PCA to retain the top \(95\%\) of the features; otherwise, we apply standard normalization. For the minority labels, the missing rate ranges from 0 to \(30\%\), with a sampling interval of \(5\%\). Overall, the number of datasets we evaluated is 42, as each benchmark has seven missing rates. The details of the datasets are shown in Table 2.

Table 2 Detail of dataset

The comparison algorithm utilizes the oversampling technique to address the imbalance in the multi-label dataset. We also used the fuzzy rough set oversampling algorithm MLCSMOTE-FRST [33] for comparison, which mined the label correlation with minority sample neighbors and used OWA to complete the label imputation. The detail of the compared oversampling methods are shown in Table 1. In the experiments of classification metrics, the generation proportion is set to \(25\%\) of the total dataset, and the number of nearest neighbors in MLSMOTE, MLSOL, and MLAWSMOTE is set to 10.

The iteration rounds of MLAWSMOTE are set to 100 epochs. After each round of the training process, the difference between WX and \(\Omega Y\) is recorded. The epoch with the smallest difference is selected to obtain the relevant W and \(\Omega \). The parameter \(\lambda _{1}\) to \( \lambda _{4}\) is set consistently from \(1e^{-3}\)to \(1e^{-6}\) to find the maximum convertible parameter.

The classification algorithm used in the experiment consists of two types: problem transformation and matrix decomposition. Problem transformation techniques include binary relevance (BR), classifier chains (CC) [38], and random k-labelsets (RAkEL) [45]. These three methods transform the multi-label problem into binary classification, chain structure, and label subsets, respectively. Multi-label classification with group-based mapping (MC-GM) [35] is a type of matrix decomposition that assumes similar features and labels correspond to instances in the same group. Label extraction is carried out based on the relationship between groups and labels.

There are multiple ways to evaluate performance in imbalanced multi-label datasets. We would prefer to evaluate the overall classification effectiveness while also paying attention to minority groups. Based on these two considerations, we choose example-based accuracy (ACC) and Hamming loss (HL) to describe the overall metrics of different data generation methods corresponding to classification. We use Macro F1 and Micro F1 to reflect the metrics related to the minority. The different metrics are as follows:

$$\begin{aligned}{} & {} {\text {ACC}}(\uparrow ) = \frac{1}{n} \sum _{i=1}^{n} \frac{ \mid y_{i} \bigcap {\hat{y}}_{i} \mid }{ \mid y_{i} \bigcup {\hat{y}}_{i} \mid } \end{aligned}$$
(14)
$$\begin{aligned}{} & {} {\text {HL}}(\downarrow ) = \frac{1}{nq} \sum _{i=1}^{n} \sum _{j=1}^{q} I(y_{j}^{(i)} \ne {\hat{y}}_{j}^{(i)} ) \end{aligned}$$
(15)
$$\begin{aligned}{} & {} {\text {Macro}}F1(\uparrow ) = \frac{\sum _{1}^{j}(\mathrm {F1-score})}{j} \end{aligned}$$
(16)
$$\begin{aligned}{} & {} {\text {Micro}}F1(\uparrow ) = \frac{{\text {TP}}}{{\text {TP}} + 0.5 * ({\text {FP}} + {\text {FN}})} \end{aligned}$$
(17)

where HL measures the number of incorrectly predicted labels as a percentage of the overall number of labeled labels. The smaller indicated the better. In ACC, MacroF1, MicroF1, the higher means the better performance, TP, FP and FN are the sum of the corresponding values of each class. ACC is the accuracy of all data in the test set, which can reflect the impact on the accuracy of all data after adding synthetic samples. MacroF1 takes the mean value of F1-Score of all labels and MicroF1 calculates the accuracy and recall of all labels to obtain the F1-Score value according to the calculated results. In the case of an extremely imbalanced case, MacroF1 can better reflect the classification of the minority because it considers all classes in a balanced way to get the F1-score.

Table 3 Comparative results on emotions, birds, and enron under BR
Table 4 Comparative results on yeast, scene, and medical under BR

4.2 Results of Sampling Methods

Due to space constraints, we present the model performance corresponding to the BR classifier (with \(10\%\) as the missing label interval) in Tables 3 and 4, where BR might have a better adaptability in missing label by removing invisible samples of labels [54]. Also, in Tables 5 and 6, we present the average ranking of for the four classifiers, the missing label interval is same with Tables 3 and 4. The complete average and standard deviation performance is presented in the supplementary material.Footnote 1 We note that the oversampling methods and MLAWSMOTE contain SMOTE random seed and other random parameters. To verify the stability of each method, we conducted each experiment five times to calculate the average values.

Table 5 Average ranking under RAkEL and BR
Table 6 Average ranking under MC-GM and CC

We perform the following analysis using various metrics, including missing label rates and applied classifiers.

4.2.1 Comparison of Oversampling Algorithms

Compared to other oversampling methods, our proposed method performs well with all four classifiers. In three problem-transfer classifiers, our proposed method achieves the optimal average values when minority labels are missing. It shows that our proposed method can help the classifiers maintain good performance for both global and minority-related outcomes. Other oversampling methods, on the other hand, have inconsistent effects on the classifiers. For example, LP-ROS achieves suboptimal MicroF1 for Rakel at a \(30\%\) missing rate, while the CC classifier is weaker than ML-ROS and MLSOL.

4.2.2 Impact from Minority Labels Missing Rate

Typically, the metrics exhibit a declining trend with an increase in the label missing rate. In the case of MacroF1, the metrics experienced reductions of \(24.78\%\), \(17.81\%\), \(3.8\%\), and \(19.56\%\) for the four classifiers, respectively, in the absence of synthetic instances. The reductions in performance for MLAWSMOTE are \(15.79\%\), \(13.63\%\), \(3.78\%\), and \(15.21\%\), respectively, demonstrating its superior performance compared to other oversampling techniques. In comparison to MacroF1, MicroF1 demonstrates a smaller decrease when using MLAWSMOTE, with reductions of \(9.79\%\), \(8.1\%\), \(2.8\%\), and \(10.31\%\). The decrease in MicroF1 was observed to be less pronounced compared to MacroF1 as the percentage of missing labels increased in the other oversampling methods. We posit that the sensitivity of MacroF1 to the identification of minority classes is the underlying reason.

4.2.3 Impact from Different Classifiers

In comparison to transfer-based methods, the MC-GM correlation metrics exhibit less variation as the missing rate increases. MC-GM also incorporates label-related information by utilizing global group and instance-group relations, thereby mitigating the impact of label missing on classification outcomes. Leveraging the synthetic instances generated by MLAWSMOTE, MC-GM attained the highest average metrics in terms of ACC, HL, and MicroF1. In three assessed transfer-based methodologies for problem-solving, the performance evaluation revealed a pattern of declining metrics with an increase in the missing label rate. In the case of the Emotions dataset, for example, MacroF1 decreased by \(36.99\%\), \(25.78\%\), and \(33.42\%\) without the addition of synthetic data. In contrast, when using the MLAWSMOTE method, MacroF1 decreased by \(16.45\%\), \(14.2\%\), and \(19.69\%\) respectively, demonstrating the stability of the corresponding metrics of our method. In addition, various datasets and prediction targets necessitate the selection of suitable classifiers. For example, the MicroF1 of Rakel with MLAWSMOTE is \(44.28\%\), whereas it is \(11.82\%\) in MC-GM with MLAWSMOTE. However, this already represents the best performance for MC-GM with the addition of synthetic methods.

To further inspect the significance of our proposed algorithm, we perform parameter tests in Sect. 4.3.

4.3 Significance Test

To compare oversampling algorithm metric changes with varying label missing rates, we categorize missing rates as low (5–15\(\%\)) and high (20–30\(\%\)), then compare the Friedman test results for each classifier. A p-value less than 0.05 indicates a significant difference between the oversampling algorithms. A smaller p-value indicates a more significant difference (Table 7).

Table 7 p-value for Friedman test in MLAWSMOTE

The MC-GM classifier shows no significant difference between the oversampling methods for MacroF1 at low missing levels, whereas the p-value at high levels is less than 0.05. Significant differences were observed in other problem transfer-related classifiers. The p-value of BR in MacroF1 and MicroF1 is significantly smaller than that of Rakel and CC, indicating a substantial difference.

We also present the Nemenyi test in the BR classifier in Figs. 4 and 5, CD equals 1.9389, corresponding to 18 datasets (six datasets with three missing rates for low and high missing levels). MLAWSMOTE stands out from other oversampling methods in terms of ACC, MacroF1, and MicroF1 at the high label missing rates. Among the validated metrics, MLAWSMOTE tends to group similarly to MLCSMOTE-FRST, LP-ROS and ML-ROS. The original dataset for the HL, showing less variability compared to the other three metrics. According to Table 5, LP-ROS improves average rankings for certain minority-sensitive metrics as the label missing rate increases, but it is not significantly different from not adding simulated data. Our proposed oversampling method stands out for its higher significance compared to other methods.

Fig. 4
figure 4

Comparison with competing algorithms using Nemenyi test in low level missing rate, CD = 1.9389

Fig. 5
figure 5

Comparison with competing algorithms using Nemenyi test in high level missing rate, CD = 1.9389

4.4 Parameter Sensitivity

The parameters of MLAWSMOTE can be divided into two parts: the parameters \(\lambda _{1},\lambda _{2},\lambda _{3},\lambda _{4}\) related to MLAWSMOTE convergence and the parameters r and p related to oversampling. For convergence-related parameters \(\lambda _{1}\) to \(\lambda _{4}\), we recommend grid search for parameter selection. The reason is that we find that the magnitude of \(\lambda \)-related parameters has different effects on the results. Taking Yeast and BR as an example, when we change one parameter and fix the other three parameters, the impact on classification results is shown in Fig. 6. By applying grid search, when \(\lambda _{1} = 0.001\), \(\lambda _{2} = 0.0001\), \(\lambda _{3} = 1\) and \(\lambda _{4} = 0.001\), Yeast achieves the best classification level.

Fig. 6
figure 6

Performance on Yeast with varying values in \(\lambda _{1}\) to \(\lambda _{4}\) on indicators under BR

Similar to most oversampling methods for multi-label imbalance learning, MLAWSMOTE is concerned with the number of neighbors k and the generation ratio p. k is concerned with the expansion of the space of possible generation of minority samples and is related to the ability of the algorithm to construct decision boundaries. p focuses on the number of samples generated and is related to the degree of imbalance reduction. In this section, we use Yeast as the evaluated dataset, which is also the primary binary and multi class dataset. The classifier uses BR which achieves good performance as a classifier in both MacroF1 and MicroF1. Missing rate equals to \(30\%\) was applied as a control ratio to verify the parameter variation in the extreme minority of missing labels.

Fig. 7
figure 7

The impact of changing sampling neighborhood k on indicators under BR

Fig. 8
figure 8

The impact of changing oversampling ratio p on indicators under BR

In Fig. 7, changes in k have minimal impact on performance, particularly when there is an extreme missing rate of minority labels. In such cases, limited labeling is available for generating synthetic samples over an extended range. Interestingly, in Fig. 8, all the metrics of MLAWSMOTE show an increasing trend as the sampling rate increases. The metrics of the other oversampling algorithms show little change. The synthetic instances generated by MLAWSMOTE can enhance the classifier’s focus on minorities and complement labeling information, improving metrics like ACC and HL. In practical applications, it is essential to comprehensively design the oversampling ratio based on the computational performance of the classifier.

4.5 Discussion

MLML is often formulated as a matrix completion problem that uses low-rank assumptions about the matrix to fill in the missing entries based on the observed items, which is also the role of \(\Omega \) completion. In addition, Eq. (2) incorporates the second-order label correlation by calculating the Euclidean distance. This is based on the manifold hypothesis of label space. That is, if the label correlation is high, they will have similar label-specific features, and the corresponding W has a similar Euclidean distance. MLAWSMOTE, designed on the basis of these assumptions, has the following advantages and limitations.

MLAWSMOTE can be used directly in multi-label classifiers and performs well in multiple classifiers. We found that both the global indicators and minority label related indicators have improved. As can be seen from Fig. 4c, for MicroF1, label imputation method MLCSMOTE-FRST still has some advantages when the label missing rate is low, but with the increase of label missing rate, both MacroF1 and microF1 show excellent performance.

However, compared with other oversampling algorithms, MLAWSMOTE requires multiple iterations to learn W and \(\Omega \), and needs to choose on parameters of \(\lambda \), which requires more computation time for other methods. In terms of the performance of classifiers, although the advantage of problem-transfer classifiers is obvious, the MC-GM improvement effect on labels that can be supplemented is modest.

5 Conclusion

We introduce MLAWSMOTE, an oversampling method combined with matrix decomposition, to tackle minority label missing in multi-label issues. MLAWSMOTE differs from traditional oversampling by focusing on label correlations. MLAWSMOTE learn prediction and label association weights based on the assumption of label association. It adjusting data generation ratios based on recovered labels versus original ones. For synthetic instances in the feature space, auxiliary weights are used to complement their label space information. Our proposed algorithm shows strong performance across diverse datasets with varying missing rates, the classification performance of problem-transfer classifiers is improved. Concretely, MLAWSMOTE achieving high scores in global metrics like ACC and HL, as well as minority correlation metrics such as MacroF1 and MicroF1.

In future research, further study is needed to effectively define the generation ratio in imMLML. MLAWSMOTE stands out from other oversampling methods as the metrics improve with an increase in the generation ratio. The integration of training efficiency of the classifier and desired metrics to determine the number of synthetic instances generated needs further refinement.