1 Introduction

MLL is the training of a model that can assign possible labels to unknown instances and has been applied in various domains such as text classification [1], image annotation [2], protein function detection [3] and personalized recommendation [4], etc. For example, in image annotation, a landscape instance often contains labels such as sky, sea, and beach. With the rapid development of the Internet, data is generally characterized by high dimensionality [5]. Most of the existing label-specific feature (LSF) learning [6] is embedded-based, which can embed high-dimensional data into a low-dimensional potential space. The dimensional disaster problem is effectively solved and the classification performance of MLL is improved.

In MLL, logical label can easily introduce the problem of instance distribution non-equilibrium and degrade the classification performance of multi-label. Converting logical labels into digital labels with label enhancement is starting to emerge. The approach also has promising applications in intelligent fault detection in the field of artificial intelligence [7,8,9]. However, the label enhancement approach preprocesses the data resulting in high time complexity. The utilization of the non-equilibrium approach not only efficiently reduces the time complexity of the algorithm but also demonstrates excellent performance in classification.

Data non-equilibrium [10] can be divided into two types: inter-class non-equilibrium and intra-class non-equilibrium. Although using methods based on inter-class non-equilibrium can initially relieve the distribution non-equilibrium problem of multi-label data, this approach only divides all labels into positive and negative categories, ignoring the correlation between labels. In contrast, methods based on intra-class non-equilibrium calculate the density of each label within each instance to replace the original label. This not only alleviates the problem of label classification non-equilibrium and expands the classification margin between labels but also takes into account the correlation between different labels.

Label correlation (LC) [11] has been widely applied in LSF learning. In the study of LC, many scholars have found that multi-label datasets generally exhibit a characteristic of label distribution non-equilibrium. It is gradually realized that labels are not simply symmetrically related to each other, but rather asymmetrically related, or even causally related with some kind of directivity. Asymmetry and causality are both relationships between labels, but causality [12] has a clear pointing relationship than asymmetry.

For example, Fig. 1 shows a causal relationship of possible causes of depression, with heartbroken and abnormal neurotransmitter secretion as the cause and depression as the effect. However, depression does not necessarily lead to abnormal neurotransmitter secretion, suggesting that the causal relationship between them is asymmetric. In contrast, cosine similarity calculates a symmetrical labeling correlation that is often symmetrical, leading to inaccurate LSF learning. The PC algorithm can learn true causal relationships among a small number of labels, which corrects spurious information on labels and improves the efficiency of learning LSF.

Fig. 1
figure 1

Causal relationship of depression

Due to the prevalence of label distribution non-equilibrium in multi-label data, in this paper, we alleviate the label non-equilibrium problem by replacing the original matrix with the label density matrix derived from the intra-class non-equilibrium method. In order to obtain the most realistic label-specific features and at the same time improve the real running time of the algorithm, we construct the neighbor matrix for the causal relationship of the original label balls, which is combined with the correlation of the density matrix computed by using the cosine similarity to solve the problem of mixed spurious correlation in the label correlation obtained by the traditional method, and in this way, we induce the learning of label-specific features. Based on the above analysis, this paper proposes a causality-driven intra-class non-equilibrium label-specific features learning algorithm, the contribution of which is as follows:

  1. 1.

    Using the PC algorithm to compute the causal relationships between labels and combining them with density label correlation to construct causal density label correlation, instead of traditional LC, can be induced to learn a more accurate LSF

  2. 2.

    Using the intra-class non-equilibrium method to calculate the label density of all instances instead of the original matrix can effectively alleviate the non-equilibrium of label distribution and further expand the classification interval surface between labels.

  3. 3.

    The algorithm is disassembled in an ablative analysis and compared with the CNSF-IC algorithm that considers inter-class density and causality. It is demonstrated that considering intra-class density and causality can effectively improve the classification performance of MLL.

The remaining sections of this paper are organized as follows: Sect. 2 presents related works. In Sect. 3, the construction and optimization process of the CNSF model are introduced. Section 4 presents the dataset, evaluation metrics, compared algorithms, and parameter settings used in the experiments. In Sect. 5, ablation analysis, and statistical hypothesis testing to demonstrate the effectiveness of our proposed method are conducted. Section 6 provides a summary.

2 Related Work

Traditional MLL algorithms consider that all class labels are distinguished based on the same features. However, this classification is not reasonable and the classification results are often sub-optimal. The LIFT algorithm proposed by Zhang et al. [13] assumed that each class label is classified based on unique features, which can significantly improve the classification performance of MLL compared to the traditional problem transformation and algorithmic adaptation. But this algorithm does not take the LC into account. The LLSF algorithm proposed by Huang et al. [14] effectively improves the classification performance of LSF based on the assumption that strongly correlated labels can share more features than weakly correlated or uncorrelated ones. However, the algorithm considers only LC cannot achieve good classification performance. The FF-MLLA algorithm proposed by Cheng et al. [15] measured inter-sample similarity using Minkowski distance based on LC and classified multiple labels utilizing singular value decomposition and an extreme learning machine. Han et al. [16] proposed the LSF-CI algorithm to further improve MLL performance when not only LC is considered, but also correlation between instances is computed from the feature space using a probabilistic graphical model. The LSML algorithm proposed by Huang et al. [17] learned to recover the LSF by obtaining a label complementary matrix from the missing label through higher-order LC, which effectively solved the missing problem.

With the rapid development of the Internet, the dimensionality of multi-label data is increasing, and the sparsity and non-equilibrium of labels is becoming more and more severe. The classification performance of multi-label is seriously diminished. The GroPLE algorithm proposed by Kumar et al. [18] makes the sparsity of each group invariant by embedding the label vector in the low-dimensional space. The features and labels are then mapped separately to the low-dimensional space through a linear mapping as a way to build an efficient MLL method. Wang et al. [19] proposed the GLSFS-LDCM algorithm used spectral clustering to reduce the computational effort of class labels and the problem of non-equilibrium label distribution is alleviated by using the label density within a class, which effectively reduces the problem of high time consumption and low classification accuracy of multi-dimensional data. Liu et al. [20] proposed a new method for assessing local label non-equilibrium in datasets. The local label over-sampling method (MLSOS) and under-sampling method (MLUL) are used to solve the label distribution non-equilibrium problem. The ACML algorithm proposed by Wang et al. [21] used conditional independence test method on the base of LLSF algorithm to calculate the asymmetric relationship between labels. Zhang et al. [22] used a label propagation method to convert the logistic matrix into a numerical matrix. But the conditional independence test was used to consider the asymmetric relationship between the logic labels. The CCSRMC algorithm is proposed to mine more semantic information. Depending on the label asymmetry, Zhao et al. [23] proposed the LSGL algorithm to classify multiple labels based on the assumption that global and local relevance exist simultaneously. Wu et al. explained the problem that the conditional independence test often used in Markov bounds (MB) may lead to a low discovery rate of MB by introducing a new concept of PCM masks. Wu et al. [24] proposed the CCMB algorithm to improve the conditional independence test by cross-testing and complementary MB. The ELCS algorithm proposed by Yang et al. [25] introduced the concept of N-structure and designed an efficient Markov blanket for the first time. Combining the Markov blanket and the N-structure not only learns the Markov structure of the target variable but also allows us to find both direct and indirect causes that distinguish the target variable. Yu et al. [26] used the local label causal structure to learn the causal relationships of each class label and select features with causal information accordingly. The proposed ML2C algorithm has better classification performance, which can correct false discoveries caused by LC.

3 Model Construction and Optimization

3.1 Model Construction

In multi-label learning, \({\varvec{X}}\) is the feature matrix, \({\varvec{Y}}\) is the label matrix, \({\varvec{X}}\in {\mathbb{R}}^{n\times d}\), \({\varvec{Y}}\in {\mathbb{R}}^{n\times {l}}\), \({l}\), \(n\), \(d\) are the number of labels, instances, and features respectively. The data set \({\varvec{D}}=\left\{\left({{\varvec{x}}}_{1},{{\varvec{y}}}_{1}\right),\left({{\varvec{x}}}_{2},{{\varvec{y}}}_{2}\right),\dots ,({{\varvec{x}}}_{n},{{\varvec{y}}}_{n})\right\}\), and \({{\varvec{x}}}_{n}=\left\{{x}_{n}^{1},{x}_{n}^{2},\dots ,{x}_{n}^{j}\right\}\), \({{\varvec{y}}}_{n}=\left\{{y}_{n}^{1},{y}_{n}^{2},\dots ,{y}_{n}^{i}\right\}\) \(\left(j=1,\dots d, i=1,\dots ,{l}\right)\) denote the feature and label vectors. Classification of multiple labels with the same features is less accurate. More accurate classification can be achieved by LSF. Combined with the LLSF [14] algorithm proposed by Huang et al., the base model of CNSF can be written as:

$${\underset{{\varvec{W}}}{{\text{min}}}\frac{1}{2}\Vert {\varvec{X}}{\varvec{W}}-{\varvec{Y}}\Vert }_{F}^{2}+\beta {\Vert {\varvec{W}}\Vert }_{1}$$
(1)

where \(\beta \) is the feature sparse parameter, \({\varvec{W}}\) is the weighting factor with \({\varvec{W}}=\left[{{\varvec{w}}}_{1},{\mathbf{w}}_{2},{\mathbf{w}}_{3},\dots ,{\mathbf{w}}_{{l}}\right]\in {\mathbb{R}}^{d\times {l}}\), and \({\mathbf{w}}_{{l}}\in {\mathbb{R}}^{d}\) denotes LSF of each label.

For the non-equilibrium in label distribution caused by sparse label space, the algorithms GLSFS-LDCM [19] initially alleviate this non-equilibrium by calculating the inter-class density of labels. However, the expansion of the classification margin is not significant in large datasets, and the calculation method for inter-class non-equilibrium density treats each positive label (negative label) within each instance equally important, ignoring the importance of different labels across instances, eventually resulting in the inability to accurately calculate the correlation between labels. The intra-class non-equilibrium method not only considers the differences between different labels, but also further expands the classification margin surface of the labels, relieving the problem of label distribution non-equilibrium. The details are shown in Table 1. The Eq. 2 is the formula for calculating the intra-class non-equilibrium matrix.

Table 1 Labels change with inter-class and intra-class non-equilibrium
$$P=\left\{\begin{array}{c}\sum_{i=1}^{{l}}\frac{{I(y}_{i}^{{l}}=1)}{{\text{n}}}+{y}_{i}^{{l}}\\ -\sum_{i=1}^{{l}}\frac{{I(y}_{i}^{{l}}=0)}{{\text{n}}}-{y}_{i}^{{l}}\end{array}\right.$$
(2)

The \({\varvec{P}}\) represents the label density within each instance, \({y}_{i}^{{l}}\) denotes the \(i\)th label in the \({l}\)th instance of the dataset, \(I(\bullet )\) is an indicator function, which gives 1 when \({y}_{i}^{{l}}=1\), while returns 0 otherwise. The distributional non-equilibrium of the original labeling matrix Y can be mitigated effectively by using an intra-class non-equilibrium procedure. Considering Eq. (2), the Eq. (1) can be written as follows:

$${\underset{{\varvec{W}}}{{\text{min}}}\frac{1}{2}\Vert {\varvec{X}}{\varvec{W}}-{\varvec{P}}\Vert }_{F}^{2}+\beta {\Vert {\varvec{W}}\Vert }_{1}$$
(3)

Although this method can initially alleviate the problem that traditional LC relies too much on the original labels, it does not distinguish spurious correlations between them. Causal algorithms can calculate the causal relationships between labels from limited data, thus obtaining more realistic LC and extracting more realistic LSF.

Currently, extensive research leverages causal relationships as prior knowledge to facilitate model learning and enhance the interpretability of models. In multi-label learning, causal relationships between labels can significantly improve the classification performance of multiple labels. For example, if two labels have a strong causal relationship, then when calculating the correlation between them using cosine similarity, it can be seen that the two labels have a strong correlation. In causal learning, researchers have proposed many causal structure learning methods based on the idea of conditional independence tests based on restriction optimization, the most classical of which is the PC [27] algorithm. The main idea of the PC algorithm is to use the chi-square test to calculate the difference between the joint probability distribution and the marginal probability distribution and to test whether there is independence between the two probabilities. According to the Markov assumption, the conditional independence relationships implied by Bayesian networks can factorize the joint probability distribution as:

$$Y\left(V\right)={\prod }_{i=1}^{{l}}Y\left({V}_{i}|Yc\left({V}_{i}\right)\right)$$
(4)

where \(Yc\left({V}_{i}\right)\) denotes the set of all parents of node \({V}_{i}\). Equation (4) indicates that the conditional independence assumptions implied by Bayesian networks factorize \(Y(V)\) into a series of local conditional probability distributions, with each local factor representing the conditional probability distribution of a variable given its parents in the network. This also provides a low-dimensional representation of a complex high-dimensional probability distribution, which makes it difficult to directly compute \(Y(V)=({V}_{1},{V}_{2},\dots ,{V}_{{l}})\). Assuming \(({V}_{1},{V}_{2},\dots ,{V}_{{l}})\) is a topological sequence relative to the directed acyclic graph of \(V\), the probability distribution \(Y(V)\) can generally be decomposed using the chain rule of conditional probabilities:

$$Y\left({V}_{1},{V}_{2},\dots ,{V}_{{l}}\right)=Y\left({V}_{1}\right){\prod }_{i=2}^{{l}}Y\left({V}_{i}\right|{V}_{1},\dots ,{V}_{i-1})$$
(5)

We can use the chi-squared test to calculate the value of \(\rho \) and obtain the causal relationship \({\varvec{M}}\). But this method also has certain disadvantages. The first is that multi-label algorithms are more hypothesis dependent on the data, but the absence of manually labeled causal graphs that can be learned in multi-label learning has limited the research on causal multi-label learning to some extent, and the second is that the PC algorithm actually obtains a directed acyclic graph, which does not confirm that this is the true causal relationship between the labels.

Then, we use cosine similarity to measure the correlation \({\varvec{C}}\) between density labels, where \({\varvec{R}}=1-{\varvec{C}}\) and \(\alpha \) is the hyperparameter of the causal label density correlation matrix and \(\alpha >0\). The causal label density correlation matrix \({\varvec{D}}\) is constructed by combining causality and label density correlation, as a means of mining more semantic information about the labels, where \({\varvec{D}}={\varvec{M}}\odot{\varvec{R}}\), \(\odot\) represents the Hadamard product. Combining Eq. (3), the final model can be written as:

$${\underset{{\varvec{W}}}{{\text{min}}}\frac{1}{2}\Vert {\varvec{X}}{\varvec{W}}-{\varvec{P}}\Vert }_{F}^{2}+\alpha tr\left({\varvec{D}}{{\varvec{W}}}^{{\text{T}}}{\varvec{W}}\right)+\beta {\Vert {\varvec{W}}\Vert }_{1}$$
(6)

3.2 Model Optimization

The CNSF model is a convex optimization problem. Due to the non-smoothness of the \({{l}}_{1}\)-norm, we adopt the accelerated proximal gradient descent method (APGD) [28] to iteratively solve non-smooth problem of the weight matrix \({\varvec{W}}\). The objective function is:

$$\underset{W\in \mathcal{H}}{{\text{min}}}F\left({\varvec{W}}\right)=f\left({\varvec{W}}\right)+g\left({\varvec{W}}\right)$$
(7)

where \(\mathcal{H}\) is the Hilbert space and the expressions for \(f\left({\varvec{W}}\right)\) and \(g\left({\varvec{W}}\right)\) are shown in Eqs. (8) and (9). Both are convex functions and satisfy the Lipschitz condition.

$$f\left({\varvec{W}}\right)={\underset{{\varvec{W}}}{{\text{min}}}\frac{1}{2}\Vert {\varvec{X}}{\varvec{W}}-{\varvec{P}}\Vert }_{F}^{2}+\alpha tr\left({\varvec{D}}{{\varvec{W}}}^{{\text{T}}}{\varvec{W}}\right)$$
(8)
$$g\left({\varvec{W}}\right)=\beta {\Vert {\varvec{W}}\Vert }_{1}$$
(9)
$$\nabla f\left({\varvec{W}}\right)={{\varvec{X}}}^{{\text{T}}}XW-{{\varvec{X}}}^{{\text{T}}}P+2\alpha WD$$
(10)

For any matrix \({{\varvec{W}}}_{1}\), \({{\varvec{W}}}_{2}\) there is:

$$\Vert \nabla f\left({{\varvec{W}}}_{1}\right)-\nabla f\left({{\varvec{W}}}_{2}\right)\Vert \le {L}_{g}\Vert \Delta {\varvec{W}}\Vert $$
(11)

where \({L}_{g}\) is the Lipschitz constant and \(\Delta {\varvec{W}}={{\varvec{W}}}_{1}-{{\varvec{W}}}_{2}\). Introducing the \({\varvec{Q}}\left({\varvec{W}},{{\varvec{W}}}^{(t)}\right)\) quadratic approximation to \(F\left({\varvec{W}}\right)\), then \({\varvec{Q}}\left({\varvec{W}},{{\varvec{W}}}^{(t)}\right)\):

$$Q\left({\varvec{W}},{{\varvec{W}}}^{\left(t\right)}\right)=f\left({{\varvec{W}}}^{\left(t\right)}\right)+\left(\nabla f\left({{\varvec{W}}}^{\left(t\right)}\right),{\varvec{W}}-{{\varvec{W}}}^{\left(t\right)}\right)+\frac{{L}_{g}}{2}{\Vert {\varvec{W}}-{{\varvec{W}}}^{\left(t\right)}\Vert }_{F}^{2}+g\left({\varvec{W}}\right)$$
(12)

Let \({{\varvec{q}}}_{t}={{\varvec{W}}}^{\left(t\right)}-\frac{1}{{L}_{g}}\nabla f\left({{\varvec{W}}}^{\left(t\right)}\right)\), then:

$${{\varvec{W}}}_{{\varvec{t}}}={\text{arg}}\underset{W}{{\text{min}}}Q\left({\varvec{W}},{{\varvec{W}}}^{\left(t\right)}\right)={\text{arg}}\underset{W}{{\text{min}}}\frac{{L}_{g}}{2}{\Vert {\varvec{W}}-{{\varvec{q}}}^{\left(t\right)}\Vert }_{F}^{2}+\frac{\beta }{{L}_{g}}{\Vert {\varvec{W}}\Vert }_{1}$$
(13)

The optimization algorithm proposed by Lin et al. [29] indicated that:

$${{\varvec{W}}}^{\left(t\right)}={{\varvec{W}}}_{{\varvec{t}}}+\frac{{\theta }_{t-1}-1}{{\theta }_{t}}\left({{\varvec{W}}}_{t}-{{\varvec{W}}}_{t-1}\right)$$
(14)

in Eq. (14), \({\theta }_{t}\) satisfies \({\theta }_{t+1}^{2}-{\theta }_{t+1}\le {\theta }_{t}^{2}\), while the convergence rate of \({\varvec{O}}\left({t}^{-2}\right)\) is improved. \({{\varvec{W}}}_{t}\) is the result of the \(t\)th iteration of \({\varvec{W}}\). The soft threshold function for performing iterative operations is shown in Eq. (15):

$${{\varvec{W}}}_{t+1}={{\varvec{S}}}_{\varepsilon }\left[{{\varvec{q}}}^{\left(t\right)}\right]={\text{arg}}\underset{W}{{\text{min}}}\varepsilon {\Vert {\varvec{W}}\Vert }_{1}+\frac{{L}_{g}}{2}{\Vert {\varvec{W}}-{{\varvec{q}}}^{\left(t\right)}\Vert }_{F}^{2}$$
(15)

where \({{\varvec{S}}}_{\varepsilon }\left[\bullet \right]\) is the soft threshold operator. For any one parameter \({x}_{ij}\) and \(\varepsilon =\frac{\beta }{{L}_{g}}\), there are:

$${{\varvec{S}}}_{\varepsilon }\left({x}_{ij}\right)=\left\{\begin{array}{cc}{x}_{ij}-\varepsilon & when\ {x}_{ij}>\varepsilon \\ \begin{array}{c}{x}_{ij}+\varepsilon \\ 0\end{array}& \begin{array}{c}when \ {x}_{ij}<-\varepsilon \\ other\end{array}\end{array}\right.$$
(16)

Lipschitz constant is calculated from \(f\left({\varvec{W}}\right)\):

$${\Vert f\left({{\varvec{W}}}_{1}\right)-f\left({{\varvec{W}}}_{2}\right)\Vert }_{F}^{2}={\Vert {{\varvec{X}}}^{{\text{T}}}{\varvec{X}}\Delta {\varvec{W}}\Vert }_{F}^{2}+{\Vert 2\alpha \Delta {\varvec{W}}{\varvec{R}}\Vert }_{F}^{2}\le 2{\Vert {{\varvec{X}}}^{{\text{T}}}{\varvec{X}}\Vert }_{2}^{2}{\Vert \Delta {\varvec{W}}\Vert }_{F}^{2}+4\alpha {\Vert {\varvec{D}}\Vert }_{2}^{2}{\Vert \Delta {\varvec{W}}\Vert }_{F}^{2}$$
(17)

therefore, the Lipschitz constant for the CNSF model is:

$${L}_{g}=\sqrt{2\left({\Vert {{\varvec{X}}}^{{\text{T}}}{\varvec{X}}\Vert }_{2}^{2}+2\alpha {\Vert {\varvec{D}}\Vert }_{2}^{2}\right)}$$
(18)

Algorithm 1 describes the APGD method to solve the objective function \(F\left({\varvec{W}}\right)\) output weights \({\varvec{W}}\). In Algorithm 1, Steps 2 and 4 are for obtaining the causal labeling density matrix, Step 3 is for calculating the density labels using Eq. (2), and in Step 8, \({\mathbf{q}}_{{\text{t}}}\left(\mathbf{W}\right)\) is an intermediate variable, and \({\text{f}}(\bullet )\) represents a gradient.

Algorithm 1
figure a

Causality-driven intra-class non-equilibrium label-specific features learning. Input: Training dataset {X, Y}, parameters α, β. Output: W

3.3 Complexity Analysis

The complexity of CNSF mainly consists of three parts. Firstly, the complexity of calculating the label density matrix is \({\varvec{O}}\left(1/2{{l}}^{2}\right)\). Secondly, the complexity of using the PC algorithm to compute causal relationships between labels is \({\varvec{O}}\left(3/2{{l}}^{2}\right)\). Thirdly, the complexity of accelerating gradient descent is \({\varvec{O}}\left(\left(n+{l}+n{l}\right){d}^{2}+\left(n+d\right){{l}}^{2}\right)\). Therefore, the complexity of CNSF is \({\varvec{O}}\left(\left(n+{l}+n{l}\right){d}^{2}+\left(n+d+2\right){{l}}^{2}\right)\).To compare the comprehensive performance of our algorithm with other algorithms, we compared it with LSF-CI [16], LSML [17], ACML [21], LLSF [14], CLML [6] and CCSRMC [22]. The algorithms that are most similar in performance to ours are CCSRMC and ACML. The complexity of CCSRMC algorithm is \({\varvec{O}}\left(nd{l}\left(nd+n+d\right)+n{l}\left({{l}}^{2}+{n}^{3}{l}+n\right)+d{{l}}^{2}\right)\), while the complexity of ACML algorithm is \({\varvec{O}}\left(\left(n+{l}+n{l}\right){d}^{2}+\left(n+d+3/2\right){{l}}^{2}\right)\). Our complexity is lower than that of CCSRMC algorithm, and slightly higher than that of ACML algorithm. The FF- MLLA and GLSFL-LDCM algorithms did not provide a corresponding complexity analysis, and Table 2 summarizes the complexities of our algorithm and other comparative algorithms.

Table 2 The complexities of different algorithms

4 Experiment

4.1 Datasets

This article conducts experiments on 13 multi-label benchmark datasets from Yahoo and Mulan websites. Among them, the Birds dataset was first used in the MLSP competition to simultaneously acoustically classify multiple bird species in noisy environments. The Medical dataset is a text-based dataset that was originally used for clinical medical diagnosis, while the Genbase dataset describes multiple proteins and their structural categories. Table 3 shows detailed information about these 13 datasets, and the download links for the datasets are located at the bottom left corner of the Table.

Table 3 Multi-label datasets

4.2 Comparison Algorithm and Parameter Settings

In this paper, Hamming Loss (HL), Average Precision (AP), One Error (OE), Ranking Loss (RL) and Coverage (CV) are selected to evaluate the performance of the CNSF algorithm and eight other algorithms. The literatures [30, 31] provide definitions and formulas for these five metrics, where the larger the HL, AP metric the better the algorithm performance, and the smaller the other metrics the better. The parameter settings are as follows:

  1. 1.

    LSI-CI [16] the parameters set to \(\alpha ={2}^{10}, \beta ={2}^{8}, \gamma =1, \theta ={2}^{-8}\).

  2. 2.

    LLSF [14] the parameters set to \(\alpha ={2}^{-4}, \beta ={2}^{-6}, \gamma =1\).

  3. 3.

    LSML [17] the parameter set to \({\lambda }_{1}={10}^{1}, {\lambda }_{2}={10}^{-5}, {\lambda }_{3}={10}^{-3}, {\lambda }_{4}={10}^{-5}\).

  4. 4.

    FF-MLLA [15] the parameters set to \({\text{k}}=15\), \(\beta =\) 1, kernel parameter \({\text{RBF}}=100\).

  5. 5.

    ACML [21] the parameters set to \(\alpha \in \left[{2}^{-10},{2}^{10}\right], \beta \in \left[{2}^{-10},{2}^{10}\right]\).

  6. 6.

    CCSRMC [22] the parameters set to \(\alpha \in \left[{2}^{-10},{2}^{-1}\right], \beta \in \left[{2}^{-10},{2}^{10}\right], \gamma \in \left[{2}^{-10},{2}^{6}\right]\).

  7. 7.

    CLML [6] the parameters set to \(\alpha , \beta , {\lambda }_{1}, {\lambda }_{2}\in \left[{2}^{-10},{2}^{10}\right]\).

  8. 8.

    GLSFL-LDCM [19] the parameters set to \(\alpha =1, \mu ,\gamma \in \left\{\mathrm{0.1,1},10\right\}, K=\left[1:1:10\right], \varepsilon =0.01\).

  9. 9.

    CNSF the parameters set to \(\alpha \in \left[{10}^{-10},{10}^{-1}\right], \beta \in \left[{10}^{-10},{10}^{4}\right]\), \(\Phi =0.05\).

Table 4 presents the experimental results of the CNSF algorithm on 13 datasets in comparison with eight state-of-the-art algorithms using five different metrics. The symbol "↑" ("↓") indicates that a higher (lower) metric value is better, and "–" indicates that the algorithm cannot output experimental results on the corresponding dataset. The bold font highlights the algorithm with superior performance. Further analysis is as follows:

  1. 1.

    As can be seen from Table 4, the CNSF algorithm dominated in 49 of the 65 sets of experimental results, and the dominance rate is 75.4%. The CNSF algorithm significantly outperformed the other comparison algorithms on all 11 data sets, and the variance of the CNSF algorithm was smaller, which proves that the CNSF algorithm is more stable.

  2. 2.

    The CNSF algorithm is significantly better than CCSRMC and ACML because they both use distance covariance to perform conditional independence tests to obtain asymmetric relationships between labels. The CCSRMC algorithm outperforms the ACML algorithm because it uses label propagation to convert logical labels into numerical labels, which have richer potential semantics. However, this transformation process inevitably incurs loss and misclassification, and the CCSRMC algorithm does not take into account the non-equilibrium of label distribution, using cosine similarity directly to calculate label correlation. This is why the CNSF algorithm outperforms the CCSRMC algorithm.

  3. 3.

    CNSF outperforms both the state-of-the-art CLML algorithm and GLSFL-LDCM algorithm in several metrics, indicating that considering causality and intra-class non-equilibrium methods can effectively improve the learning efficiency of class attributes.

  4. 4.

    In Genbase dataset, the experimental results of the CNSF algorithm are slightly lower than those of the CLML algorithm simply because the CNSF algorithm does not take into account the shared features among the labels although it takes into account the asymmetric relationships among the labels. The experimental results of the CNSF algorithm on the Reference dataset are second only to CCSRMC, which is due to the fact that the CCSRMC algorithm sacrifices the experimental accuracy of the AP metrics, while the other four metrics are far better than the other algorithms.

  5. 5.

    The average ranking results of the CNSF algorithm and the eight comparison algorithms on the five metrics given in Table 5. CNSF significantly dominates on all metrics, which also demonstrates that the use of intra-class density relations and causal correlations can effectively improve the classification performance of multi-label.

Table 4 Test results of each algorithm on five metrics
Table 5 Average rank of each algorithm on five metrics

5 Algorithm Analysis

5.1 Ablation Analysis

To demonstrate that the causal LC and intra-class non-equilibrium proposed in this paper can effectively improve the classification performance of multi-label learning, this paper decomposes the CNSF algorithm into a causal learning CSF algorithm that does not consider non-equilibrium and NSF algorithm that considers intra-class non-equilibrium without considering causality. Meanwhile, to demonstrate the superiority of intra-class non-equilibrium over inter-class non-equilibrium, we construct a causal-driven inter-class non-equilibrium label-specific feature learning (CNSF-IC) and the performance of each algorithm is shown in Fig. 2. The CNSF algorithm significantly outperforms other algorithms in five metrics. Intra-class non-equilibrium significantly outperforms inter-class non-equilibrium, which can learn richer LSF. CNSF outperformed NSF, suggesting that causal relationship can significantly improve the classification performance of multi-label learning on the base of intra-class non-equilibrium.

Fig. 2
figure 2

Comparison of CNSF and degenerate algorithms on five metrics

5.2 Parameter Sensitivity Analysis

The CNSF algorithm involves two important parameters \(\alpha \) and \(\beta \). Parameter \(\alpha \) adjusts the causal co-occurrence matrix, and parameter \(\beta \) adjusts the sparsity of features. To analyze the sensitivity of these parameters, we conducted visual experiments on the Birds dataset using the Bar3 function, with both parameters ranging from \(\left({10}^{-10},{10}^{4}\right)\). Figure 3 shows that the sensitivity of the five metrics to \(\alpha \) and \(\beta \) is different, but overall, when \(\alpha \in \left({10}^{-10},{10}^{-2}\right),\beta \in \left({10}^{-10},{10}^{4}\right)\), the experimental results are better. When \(\alpha \ge {10}^{0}\), the experimental results become worse because the selected features are too sparse, leading to poor performance. However, when our algorithm was tested with \(\alpha ={10}^{0}, \beta \in \left({10}^{2},{10}^{4}\right)\), the partial experimental results suddenly deteriorated due to the high correlation between causal density and the sparsity of the selected labels, causing the model to overfit. Therefore, we recommend setting the hyperparameters to \(\alpha \in \left({10}^{-10},{10}^{-2}\right), \beta \in \left({10}^{-10},{10}^{4}\right)\).

Fig. 3
figure 3

Parameter sensitivity analysis on Birds dataset

5.3 Statistical Hypothesis Test

The statistical hypothesis test in this paper is based on a significance level of \(\varphi =0.05\). The Friedman test [32] is used to assess the comprehensive performance of the CNSF algorithm on all data sets. The resulting \({F}_{F}\) was compared with the critical value \(\left(CV\right)\) of the Friedman test. If \({F}_{F}\) is greater than Critical Value, the original hypothesis is rejected, vice versa. The experimental results are shown in Table 6. The \({F}_{F}\) of the CNSF algorithm on all metrics is greater than \(CV\), so the original hypothesis is rejected for all of them.

Table 6 The Friedman statistics \({F}_{F}\) of the critical value and each evaluation metric

The CNSF algorithm is then compared with the other eight algorithms on all datasets employing the Nemenyi test [32]. A significant difference exists when the mean ranking difference between the two algorithms on all datasets is greater than the Critical Difference (CD), and vice versa. The CD value is calculated as follows:

$$CD={q}_{\varphi }\sqrt{\frac{K\left(K+1\right)}{6N}}$$
(19)

where \(K=9\), \(N=13\), \({q}_{\varphi }=3.1020\), \(CD=3.3321\). Figure 4 presents a comparison of the CNSF algorithm with other algorithms on five metrics, with the algorithm performance decreasing from left to right. On HL and AP metrics, there are no significant differences between CNSF, CCSRMC, CMLL. On OE metric there are no significant differences between CNSF, CCSRMC, CMLL and GLSFL-LDCM. On RL metric there are no significant differences between CNSF, CCSRMC and GLSFL-LDCM. There are no significant differences between CNSF and CCSRMC, ACML, FF-MLLA CLML and GLSFL-LDCM for CV metric. Otherwise, it is significantly different from the other algorithms in all metrics. The validity of the algorithm proposed in this paper can be seen from the results of statistical hypothesis test.

Fig. 4
figure 4

Nemenyi test

6 Conclusion

LC can be effective in improving the learning of LSF. However, the non-equilibrium distribution of the original labels can lead to spurious correlations in the computed LC. To obtain a more accurate LC, we first adopt the intra-class imbalance method to solve the problem of imbalanced distribution among the original labels. Then, the PC algorithm is used to calculate the adjacency matrix of the labels to express the causal relationship between them. Lastly, a causal density label relevance is constructed to guide the learning of the most accurate LSF. However, the algorithm also has some limitations. The first is that there are no manually annotated causal graphs available for learning in multi-label learning. The algorithm relies greatly on the independent and identically distributed hypothesis, which limits the research on causal multi-label learning to a certain extent. The second is that the method of calculating LC is overly dependent on the distribution of the original labels and does not correct for false causality in iterative optimization. How to calculate more accurate LC and causal relationships will be the next step in the research.