Causality-Driven Intra-class Non-equilibrium Label-Specific Features Learning

Ge, Wenxin; Wang, Yibin; Xu, Yuting; Cheng, Yusheng

doi:10.1007/s11063-024-11439-w

Causality-Driven Intra-class Non-equilibrium Label-Specific Features Learning

Open access
Published: 21 March 2024

Volume 56, article number 120, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Causality-Driven Intra-class Non-equilibrium Label-Specific Features Learning

Download PDF

Wenxin Ge¹,
Yibin Wang^1,2,
Yuting Xu³ &
…
Yusheng Cheng^1,2

465 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

In multi-label learning, label-specific feature learning can effectively avoid some ineffectual features that interfere with the classification performance of the model. However, most of the existing label-specific feature learning algorithms improve the performance of the model for classification by constraining the solution space through label correlation. The non-equilibrium of the label distribution not only leads to some spurious correlations mixed in with the calculated label correlations but also diminishes the performance of the classification model. Causal learning can improve the classification performance and robustness of the model by capturing real causal relationships from limited data. Based on this, this paper proposes a causality-driven intra-class non-equilibrium label-specific features learning, named CNSF. Firstly, the causal relationship between the labels is learned by the Peter-Clark algorithm. Secondly, the label density of all instances is calculated by the intra-class non-equilibrium method, which is used to relieve the non-equilibrium distribution of original labels. Then, the correlation of the density matrix is calculated using cosine similarity and combined with causality to construct the causal density correlation matrix, to solve the problem of spurious correlation mixed in the label correlation obtained by traditional methods. Finally, the causal density correlation matrix is used to induce label-specific feature learning. Compared with eight state-of-the-art multi-label algorithms on thirteen datasets, the experimental results prove the reasonability and effectiveness of the algorithms in this paper.

Bi-directional mapping for multi-label learning of label-specific features

Article 21 October 2021

Weakly-supervised label distribution feature selection via label-specific features and label correlation

Article 24 September 2024

Multi-label causal feature selection based on neighbourhood mutual information

Article 23 August 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

MLL is the training of a model that can assign possible labels to unknown instances and has been applied in various domains such as text classification [1], image annotation [2], protein function detection [3] and personalized recommendation [4], etc. For example, in image annotation, a landscape instance often contains labels such as sky, sea, and beach. With the rapid development of the Internet, data is generally characterized by high dimensionality [5]. Most of the existing label-specific feature (LSF) learning [6] is embedded-based, which can embed high-dimensional data into a low-dimensional potential space. The dimensional disaster problem is effectively solved and the classification performance of MLL is improved.

In MLL, logical label can easily introduce the problem of instance distribution non-equilibrium and degrade the classification performance of multi-label. Converting logical labels into digital labels with label enhancement is starting to emerge. The approach also has promising applications in intelligent fault detection in the field of artificial intelligence [7,8,9]. However, the label enhancement approach preprocesses the data resulting in high time complexity. The utilization of the non-equilibrium approach not only efficiently reduces the time complexity of the algorithm but also demonstrates excellent performance in classification.

Data non-equilibrium [10] can be divided into two types: inter-class non-equilibrium and intra-class non-equilibrium. Although using methods based on inter-class non-equilibrium can initially relieve the distribution non-equilibrium problem of multi-label data, this approach only divides all labels into positive and negative categories, ignoring the correlation between labels. In contrast, methods based on intra-class non-equilibrium calculate the density of each label within each instance to replace the original label. This not only alleviates the problem of label classification non-equilibrium and expands the classification margin between labels but also takes into account the correlation between different labels.

Label correlation (LC) [11] has been widely applied in LSF learning. In the study of LC, many scholars have found that multi-label datasets generally exhibit a characteristic of label distribution non-equilibrium. It is gradually realized that labels are not simply symmetrically related to each other, but rather asymmetrically related, or even causally related with some kind of directivity. Asymmetry and causality are both relationships between labels, but causality [12] has a clear pointing relationship than asymmetry.

For example, Fig. 1 shows a causal relationship of possible causes of depression, with heartbroken and abnormal neurotransmitter secretion as the cause and depression as the effect. However, depression does not necessarily lead to abnormal neurotransmitter secretion, suggesting that the causal relationship between them is asymmetric. In contrast, cosine similarity calculates a symmetrical labeling correlation that is often symmetrical, leading to inaccurate LSF learning. The PC algorithm can learn true causal relationships among a small number of labels, which corrects spurious information on labels and improves the efficiency of learning LSF.

Due to the prevalence of label distribution non-equilibrium in multi-label data, in this paper, we alleviate the label non-equilibrium problem by replacing the original matrix with the label density matrix derived from the intra-class non-equilibrium method. In order to obtain the most realistic label-specific features and at the same time improve the real running time of the algorithm, we construct the neighbor matrix for the causal relationship of the original label balls, which is combined with the correlation of the density matrix computed by using the cosine similarity to solve the problem of mixed spurious correlation in the label correlation obtained by the traditional method, and in this way, we induce the learning of label-specific features. Based on the above analysis, this paper proposes a causality-driven intra-class non-equilibrium label-specific features learning algorithm, the contribution of which is as follows:

1.
Using the PC algorithm to compute the causal relationships between labels and combining them with density label correlation to construct causal density label correlation, instead of traditional LC, can be induced to learn a more accurate LSF
2.
Using the intra-class non-equilibrium method to calculate the label density of all instances instead of the original matrix can effectively alleviate the non-equilibrium of label distribution and further expand the classification interval surface between labels.
3.
The algorithm is disassembled in an ablative analysis and compared with the CNSF-IC algorithm that considers inter-class density and causality. It is demonstrated that considering intra-class density and causality can effectively improve the classification performance of MLL.

The remaining sections of this paper are organized as follows: Sect. 2 presents related works. In Sect. 3, the construction and optimization process of the CNSF model are introduced. Section 4 presents the dataset, evaluation metrics, compared algorithms, and parameter settings used in the experiments. In Sect. 5, ablation analysis, and statistical hypothesis testing to demonstrate the effectiveness of our proposed method are conducted. Section 6 provides a summary.

2 Related Work

Traditional MLL algorithms consider that all class labels are distinguished based on the same features. However, this classification is not reasonable and the classification results are often sub-optimal. The LIFT algorithm proposed by Zhang et al. [13] assumed that each class label is classified based on unique features, which can significantly improve the classification performance of MLL compared to the traditional problem transformation and algorithmic adaptation. But this algorithm does not take the LC into account. The LLSF algorithm proposed by Huang et al. [14] effectively improves the classification performance of LSF based on the assumption that strongly correlated labels can share more features than weakly correlated or uncorrelated ones. However, the algorithm considers only LC cannot achieve good classification performance. The FF-MLLA algorithm proposed by Cheng et al. [15] measured inter-sample similarity using Minkowski distance based on LC and classified multiple labels utilizing singular value decomposition and an extreme learning machine. Han et al. [16] proposed the LSF-CI algorithm to further improve MLL performance when not only LC is considered, but also correlation between instances is computed from the feature space using a probabilistic graphical model. The LSML algorithm proposed by Huang et al. [17] learned to recover the LSF by obtaining a label complementary matrix from the missing label through higher-order LC, which effectively solved the missing problem.

With the rapid development of the Internet, the dimensionality of multi-label data is increasing, and the sparsity and non-equilibrium of labels is becoming more and more severe. The classification performance of multi-label is seriously diminished. The GroPLE algorithm proposed by Kumar et al. [18] makes the sparsity of each group invariant by embedding the label vector in the low-dimensional space. The features and labels are then mapped separately to the low-dimensional space through a linear mapping as a way to build an efficient MLL method. Wang et al. [19] proposed the GLSFS-LDCM algorithm used spectral clustering to reduce the computational effort of class labels and the problem of non-equilibrium label distribution is alleviated by using the label density within a class, which effectively reduces the problem of high time consumption and low classification accuracy of multi-dimensional data. Liu et al. [20] proposed a new method for assessing local label non-equilibrium in datasets. The local label over-sampling method (MLSOS) and under-sampling method (MLUL) are used to solve the label distribution non-equilibrium problem. The ACML algorithm proposed by Wang et al. [21] used conditional independence test method on the base of LLSF algorithm to calculate the asymmetric relationship between labels. Zhang et al. [22] used a label propagation method to convert the logistic matrix into a numerical matrix. But the conditional independence test was used to consider the asymmetric relationship between the logic labels. The CCSRMC algorithm is proposed to mine more semantic information. Depending on the label asymmetry, Zhao et al. [23] proposed the LSGL algorithm to classify multiple labels based on the assumption that global and local relevance exist simultaneously. Wu et al. explained the problem that the conditional independence test often used in Markov bounds (MB) may lead to a low discovery rate of MB by introducing a new concept of PCM masks. Wu et al. [24] proposed the CCMB algorithm to improve the conditional independence test by cross-testing and complementary MB. The ELCS algorithm proposed by Yang et al. [25] introduced the concept of N-structure and designed an efficient Markov blanket for the first time. Combining the Markov blanket and the N-structure not only learns the Markov structure of the target variable but also allows us to find both direct and indirect causes that distinguish the target variable. Yu et al. [26] used the local label causal structure to learn the causal relationships of each class label and select features with causal information accordingly. The proposed ML2C algorithm has better classification performance, which can correct false discoveries caused by LC.

3 Model Construction and Optimization

3.1 Model Construction

In multi-label learning, ${\varvec{X}}$ is the feature matrix, ${\varvec{Y}}$ is the label matrix, ${\varvec{X}}\in {\mathbb{R}}^{n\times d}$, ${\varvec{Y}}\in {\mathbb{R}}^{n\times {l}}$, ${l}$, $n$, $d$ are the number of labels, instances, and features respectively. The data set ${\varvec{D}}=\left\{\left({{\varvec{x}}}_{1},{{\varvec{y}}}_{1}\right),\left({{\varvec{x}}}_{2},{{\varvec{y}}}_{2}\right),\dots ,({{\varvec{x}}}_{n},{{\varvec{y}}}_{n})\right\}$, and ${{\varvec{x}}}_{n}=\left\{{x}_{n}^{1},{x}_{n}^{2},\dots ,{x}_{n}^{j}\right\}$, ${{\varvec{y}}}_{n}=\left\{{y}_{n}^{1},{y}_{n}^{2},\dots ,{y}_{n}^{i}\right\}$ $\left(j=1,\dots d, i=1,\dots ,{l}\right)$ denote the feature and label vectors. Classification of multiple labels with the same features is less accurate. More accurate classification can be achieved by LSF. Combined with the LLSF [14] algorithm proposed by Huang et al., the base model of CNSF can be written as:

$${\underset{{\varvec{W}}}{{\text{min}}}\frac{1}{2}\Vert {\varvec{X}}{\varvec{W}}-{\varvec{Y}}\Vert }_{F}^{2}+\beta {\Vert {\varvec{W}}\Vert }_{1}$$

(1)

where $\beta $ is the feature sparse parameter, ${\varvec{W}}$ is the weighting factor with ${\varvec{W}}=\left[{{\varvec{w}}}_{1},{\mathbf{w}}_{2},{\mathbf{w}}_{3},\dots ,{\mathbf{w}}_{{l}}\right]\in {\mathbb{R}}^{d\times {l}}$, and ${\mathbf{w}}_{{l}}\in {\mathbb{R}}^{d}$ denotes LSF of each label.

For the non-equilibrium in label distribution caused by sparse label space, the algorithms GLSFS-LDCM [19] initially alleviate this non-equilibrium by calculating the inter-class density of labels. However, the expansion of the classification margin is not significant in large datasets, and the calculation method for inter-class non-equilibrium density treats each positive label (negative label) within each instance equally important, ignoring the importance of different labels across instances, eventually resulting in the inability to accurately calculate the correlation between labels. The intra-class non-equilibrium method not only considers the differences between different labels, but also further expands the classification margin surface of the labels, relieving the problem of label distribution non-equilibrium. The details are shown in Table 1. The Eq. 2 is the formula for calculating the intra-class non-equilibrium matrix.

Table 1 Labels change with inter-class and intra-class non-equilibrium

Full size table

$$P=\left\{\begin{array}{c}\sum_{i=1}^{{l}}\frac{{I(y}_{i}^{{l}}=1)}{{\text{n}}}+{y}_{i}^{{l}}\\ -\sum_{i=1}^{{l}}\frac{{I(y}_{i}^{{l}}=0)}{{\text{n}}}-{y}_{i}^{{l}}\end{array}\right.$$

(2)

The ${\varvec{P}}$ represents the label density within each instance, ${y}_{i}^{{l}}$ denotes the $i$th label in the ${l}$th instance of the dataset, $I(\bullet )$ is an indicator function, which gives 1 when ${y}_{i}^{{l}}=1$, while returns 0 otherwise. The distributional non-equilibrium of the original labeling matrix Y can be mitigated effectively by using an intra-class non-equilibrium procedure. Considering Eq. (2), the Eq. (1) can be written as follows:

$${\underset{{\varvec{W}}}{{\text{min}}}\frac{1}{2}\Vert {\varvec{X}}{\varvec{W}}-{\varvec{P}}\Vert }_{F}^{2}+\beta {\Vert {\varvec{W}}\Vert }_{1}$$

(3)

Although this method can initially alleviate the problem that traditional LC relies too much on the original labels, it does not distinguish spurious correlations between them. Causal algorithms can calculate the causal relationships between labels from limited data, thus obtaining more realistic LC and extracting more realistic LSF.

Currently, extensive research leverages causal relationships as prior knowledge to facilitate model learning and enhance the interpretability of models. In multi-label learning, causal relationships between labels can significantly improve the classification performance of multiple labels. For example, if two labels have a strong causal relationship, then when calculating the correlation between them using cosine similarity, it can be seen that the two labels have a strong correlation. In causal learning, researchers have proposed many causal structure learning methods based on the idea of conditional independence tests based on restriction optimization, the most classical of which is the PC [27] algorithm. The main idea of the PC algorithm is to use the chi-square test to calculate the difference between the joint probability distribution and the marginal probability distribution and to test whether there is independence between the two probabilities. According to the Markov assumption, the conditional independence relationships implied by Bayesian networks can factorize the joint probability distribution as:

$$Y\left(V\right)={\prod }_{i=1}^{{l}}Y\left({V}_{i}|Yc\left({V}_{i}\right)\right)$$

(4)

where $Yc\left({V}_{i}\right)$ denotes the set of all parents of node ${V}_{i}$. Equation (4) indicates that the conditional independence assumptions implied by Bayesian networks factorize $Y(V)$ into a series of local conditional probability distributions, with each local factor representing the conditional probability distribution of a variable given its parents in the network. This also provides a low-dimensional representation of a complex high-dimensional probability distribution, which makes it difficult to directly compute $Y(V)=({V}_{1},{V}_{2},\dots ,{V}_{{l}})$. Assuming $({V}_{1},{V}_{2},\dots ,{V}_{{l}})$ is a topological sequence relative to the directed acyclic graph of $V$, the probability distribution $Y(V)$ can generally be decomposed using the chain rule of conditional probabilities:

$$Y\left({V}_{1},{V}_{2},\dots ,{V}_{{l}}\right)=Y\left({V}_{1}\right){\prod }_{i=2}^{{l}}Y\left({V}_{i}\right|{V}_{1},\dots ,{V}_{i-1})$$

(5)

We can use the chi-squared test to calculate the value of $\rho $ and obtain the causal relationship ${\varvec{M}}$. But this method also has certain disadvantages. The first is that multi-label algorithms are more hypothesis dependent on the data, but the absence of manually labeled causal graphs that can be learned in multi-label learning has limited the research on causal multi-label learning to some extent, and the second is that the PC algorithm actually obtains a directed acyclic graph, which does not confirm that this is the true causal relationship between the labels.

Then, we use cosine similarity to measure the correlation ${\varvec{C}}$ between density labels, where ${\varvec{R}}=1-{\varvec{C}}$ and $\alpha $ is the hyperparameter of the causal label density correlation matrix and $\alpha >0$. The causal label density correlation matrix ${\varvec{D}}$ is constructed by combining causality and label density correlation, as a means of mining more semantic information about the labels, where ${\varvec{D}}={\varvec{M}}\odot{\varvec{R}}$, $\odot$ represents the Hadamard product. Combining Eq. (3), the final model can be written as:

$${\underset{{\varvec{W}}}{{\text{min}}}\frac{1}{2}\Vert {\varvec{X}}{\varvec{W}}-{\varvec{P}}\Vert }_{F}^{2}+\alpha tr\left({\varvec{D}}{{\varvec{W}}}^{{\text{T}}}{\varvec{W}}\right)+\beta {\Vert {\varvec{W}}\Vert }_{1}$$

(6)

3.2 Model Optimization

The CNSF model is a convex optimization problem. Due to the non-smoothness of the ${{l}}_{1}$-norm, we adopt the accelerated proximal gradient descent method (APGD) [28] to iteratively solve non-smooth problem of the weight matrix ${\varvec{W}}$. The objective function is:

$$\underset{W\in \mathcal{H}}{{\text{min}}}F\left({\varvec{W}}\right)=f\left({\varvec{W}}\right)+g\left({\varvec{W}}\right)$$

(7)

where $\mathcal{H}$ is the Hilbert space and the expressions for $f\left({\varvec{W}}\right)$ and $g\left({\varvec{W}}\right)$ are shown in Eqs. (8) and (9). Both are convex functions and satisfy the Lipschitz condition.

$$f\left({\varvec{W}}\right)={\underset{{\varvec{W}}}{{\text{min}}}\frac{1}{2}\Vert {\varvec{X}}{\varvec{W}}-{\varvec{P}}\Vert }_{F}^{2}+\alpha tr\left({\varvec{D}}{{\varvec{W}}}^{{\text{T}}}{\varvec{W}}\right)$$

(8)

$$g\left({\varvec{W}}\right)=\beta {\Vert {\varvec{W}}\Vert }_{1}$$

(9)

$$\nabla f\left({\varvec{W}}\right)={{\varvec{X}}}^{{\text{T}}}XW-{{\varvec{X}}}^{{\text{T}}}P+2\alpha WD$$

(10)

For any matrix ${{\varvec{W}}}_{1}$, ${{\varvec{W}}}_{2}$ there is:

$$\Vert \nabla f\left({{\varvec{W}}}_{1}\right)-\nabla f\left({{\varvec{W}}}_{2}\right)\Vert \le {L}_{g}\Vert \Delta {\varvec{W}}\Vert $$

(11)

where ${L}_{g}$ is the Lipschitz constant and $\Delta {\varvec{W}}={{\varvec{W}}}_{1}-{{\varvec{W}}}_{2}$. Introducing the ${\varvec{Q}}\left({\varvec{W}},{{\varvec{W}}}^{(t)}\right)$ quadratic approximation to $F\left({\varvec{W}}\right)$, then ${\varvec{Q}}\left({\varvec{W}},{{\varvec{W}}}^{(t)}\right)$:

$$Q\left({\varvec{W}},{{\varvec{W}}}^{\left(t\right)}\right)=f\left({{\varvec{W}}}^{\left(t\right)}\right)+\left(\nabla f\left({{\varvec{W}}}^{\left(t\right)}\right),{\varvec{W}}-{{\varvec{W}}}^{\left(t\right)}\right)+\frac{{L}_{g}}{2}{\Vert {\varvec{W}}-{{\varvec{W}}}^{\left(t\right)}\Vert }_{F}^{2}+g\left({\varvec{W}}\right)$$

(12)

Let ${{\varvec{q}}}_{t}={{\varvec{W}}}^{\left(t\right)}-\frac{1}{{L}_{g}}\nabla f\left({{\varvec{W}}}^{\left(t\right)}\right)$, then:

$${{\varvec{W}}}_{{\varvec{t}}}={\text{arg}}\underset{W}{{\text{min}}}Q\left({\varvec{W}},{{\varvec{W}}}^{\left(t\right)}\right)={\text{arg}}\underset{W}{{\text{min}}}\frac{{L}_{g}}{2}{\Vert {\varvec{W}}-{{\varvec{q}}}^{\left(t\right)}\Vert }_{F}^{2}+\frac{\beta }{{L}_{g}}{\Vert {\varvec{W}}\Vert }_{1}$$

(13)

The optimization algorithm proposed by Lin et al. [29] indicated that:

$${{\varvec{W}}}^{\left(t\right)}={{\varvec{W}}}_{{\varvec{t}}}+\frac{{\theta }_{t-1}-1}{{\theta }_{t}}\left({{\varvec{W}}}_{t}-{{\varvec{W}}}_{t-1}\right)$$

(14)

in Eq. (14), ${\theta }_{t}$ satisfies ${\theta }_{t+1}^{2}-{\theta }_{t+1}\le {\theta }_{t}^{2}$, while the convergence rate of ${\varvec{O}}\left({t}^{-2}\right)$ is improved. ${{\varvec{W}}}_{t}$ is the result of the $t$th iteration of ${\varvec{W}}$. The soft threshold function for performing iterative operations is shown in Eq. (15):

$${{\varvec{W}}}_{t+1}={{\varvec{S}}}_{\varepsilon }\left[{{\varvec{q}}}^{\left(t\right)}\right]={\text{arg}}\underset{W}{{\text{min}}}\varepsilon {\Vert {\varvec{W}}\Vert }_{1}+\frac{{L}_{g}}{2}{\Vert {\varvec{W}}-{{\varvec{q}}}^{\left(t\right)}\Vert }_{F}^{2}$$

(15)

where ${{\varvec{S}}}_{\varepsilon }\left[\bullet \right]$ is the soft threshold operator. For any one parameter ${x}_{ij}$ and $\varepsilon =\frac{\beta }{{L}_{g}}$, there are:

$${{\varvec{S}}}_{\varepsilon }\left({x}_{ij}\right)=\left\{\begin{array}{cc}{x}_{ij}-\varepsilon & when\ {x}_{ij}>\varepsilon \\ \begin{array}{c}{x}_{ij}+\varepsilon \\ 0\end{array}& \begin{array}{c}when \ {x}_{ij}<-\varepsilon \\ other\end{array}\end{array}\right.$$

(16)

Lipschitz constant is calculated from $f\left({\varvec{W}}\right)$:

$${\Vert f\left({{\varvec{W}}}_{1}\right)-f\left({{\varvec{W}}}_{2}\right)\Vert }_{F}^{2}={\Vert {{\varvec{X}}}^{{\text{T}}}{\varvec{X}}\Delta {\varvec{W}}\Vert }_{F}^{2}+{\Vert 2\alpha \Delta {\varvec{W}}{\varvec{R}}\Vert }_{F}^{2}\le 2{\Vert {{\varvec{X}}}^{{\text{T}}}{\varvec{X}}\Vert }_{2}^{2}{\Vert \Delta {\varvec{W}}\Vert }_{F}^{2}+4\alpha {\Vert {\varvec{D}}\Vert }_{2}^{2}{\Vert \Delta {\varvec{W}}\Vert }_{F}^{2}$$

(17)

therefore, the Lipschitz constant for the CNSF model is:

$${L}_{g}=\sqrt{2\left({\Vert {{\varvec{X}}}^{{\text{T}}}{\varvec{X}}\Vert }_{2}^{2}+2\alpha {\Vert {\varvec{D}}\Vert }_{2}^{2}\right)}$$

(18)

Algorithm 1 describes the APGD method to solve the objective function $F\left({\varvec{W}}\right)$ output weights ${\varvec{W}}$. In Algorithm 1, Steps 2 and 4 are for obtaining the causal labeling density matrix, Step 3 is for calculating the density labels using Eq. (2), and in Step 8, ${\mathbf{q}}_{{\text{t}}}\left(\mathbf{W}\right)$ is an intermediate variable, and ${\text{f}}(\bullet )$ represents a gradient.

3.3 Complexity Analysis

The complexity of CNSF mainly consists of three parts. Firstly, the complexity of calculating the label density matrix is ${\varvec{O}}\left(1/2{{l}}^{2}\right)$. Secondly, the complexity of using the PC algorithm to compute causal relationships between labels is ${\varvec{O}}\left(3/2{{l}}^{2}\right)$. Thirdly, the complexity of accelerating gradient descent is ${\varvec{O}}\left(\left(n+{l}+n{l}\right){d}^{2}+\left(n+d\right){{l}}^{2}\right)$. Therefore, the complexity of CNSF is ${\varvec{O}}\left(\left(n+{l}+n{l}\right){d}^{2}+\left(n+d+2\right){{l}}^{2}\right)$.To compare the comprehensive performance of our algorithm with other algorithms, we compared it with LSF-CI [16], LSML [17], ACML [21], LLSF [14], CLML [6] and CCSRMC [22]. The algorithms that are most similar in performance to ours are CCSRMC and ACML. The complexity of CCSRMC algorithm is ${\varvec{O}}\left(nd{l}\left(nd+n+d\right)+n{l}\left({{l}}^{2}+{n}^{3}{l}+n\right)+d{{l}}^{2}\right)$, while the complexity of ACML algorithm is ${\varvec{O}}\left(\left(n+{l}+n{l}\right){d}^{2}+\left(n+d+3/2\right){{l}}^{2}\right)$. Our complexity is lower than that of CCSRMC algorithm, and slightly higher than that of ACML algorithm. The FF- MLLA and GLSFL-LDCM algorithms did not provide a corresponding complexity analysis, and Table 2 summarizes the complexities of our algorithm and other comparative algorithms.

Table 2 The complexities of different algorithms

Full size table

4 Experiment

4.1 Datasets

This article conducts experiments on 13 multi-label benchmark datasets from Yahoo and Mulan websites. Among them, the Birds dataset was first used in the MLSP competition to simultaneously acoustically classify multiple bird species in noisy environments. The Medical dataset is a text-based dataset that was originally used for clinical medical diagnosis, while the Genbase dataset describes multiple proteins and their structural categories. Table 3 shows detailed information about these 13 datasets, and the download links for the datasets are located at the bottom left corner of the Table.

Table 3 Multi-label datasets

Full size table

4.2 Comparison Algorithm and Parameter Settings

In this paper, Hamming Loss (HL), Average Precision (AP), One Error (OE), Ranking Loss (RL) and Coverage (CV) are selected to evaluate the performance of the CNSF algorithm and eight other algorithms. The literatures [30, 31] provide definitions and formulas for these five metrics, where the larger the HL, AP metric the better the algorithm performance, and the smaller the other metrics the better. The parameter settings are as follows:

1.
LSI-CI [16] the parameters set to $\alpha ={2}^{10}, \beta ={2}^{8}, \gamma =1, \theta ={2}^{-8}$.
2.
LLSF [14] the parameters set to $\alpha ={2}^{-4}, \beta ={2}^{-6}, \gamma =1$.
3.
LSML [17] the parameter set to ${\lambda }_{1}={10}^{1}, {\lambda }_{2}={10}^{-5}, {\lambda }_{3}={10}^{-3}, {\lambda }_{4}={10}^{-5}$.
4.
FF-MLLA [15] the parameters set to ${\text{k}}=15$, $\beta =$ 1, kernel parameter ${\text{RBF}}=100$.
5.
ACML [21] the parameters set to $\alpha \in \left[{2}^{-10},{2}^{10}\right], \beta \in \left[{2}^{-10},{2}^{10}\right]$.
6.
CCSRMC [22] the parameters set to $\alpha \in \left[{2}^{-10},{2}^{-1}\right], \beta \in \left[{2}^{-10},{2}^{10}\right], \gamma \in \left[{2}^{-10},{2}^{6}\right]$.
7.
CLML [6] the parameters set to $\alpha , \beta , {\lambda }_{1}, {\lambda }_{2}\in \left[{2}^{-10},{2}^{10}\right]$.
8.
GLSFL-LDCM [19] the parameters set to $\alpha =1, \mu ,\gamma \in \left\{\mathrm{0.1,1},10\right\}, K=\left[1:1:10\right], \varepsilon =0.01$.
9.
CNSF the parameters set to $\alpha \in \left[{10}^{-10},{10}^{-1}\right], \beta \in \left[{10}^{-10},{10}^{4}\right]$, $\Phi =0.05$.

Table 4 presents the experimental results of the CNSF algorithm on 13 datasets in comparison with eight state-of-the-art algorithms using five different metrics. The symbol "↑" ("↓") indicates that a higher (lower) metric value is better, and "–" indicates that the algorithm cannot output experimental results on the corresponding dataset. The bold font highlights the algorithm with superior performance. Further analysis is as follows:

1.
As can be seen from Table 4, the CNSF algorithm dominated in 49 of the 65 sets of experimental results, and the dominance rate is 75.4%. The CNSF algorithm significantly outperformed the other comparison algorithms on all 11 data sets, and the variance of the CNSF algorithm was smaller, which proves that the CNSF algorithm is more stable.
2.
The CNSF algorithm is significantly better than CCSRMC and ACML because they both use distance covariance to perform conditional independence tests to obtain asymmetric relationships between labels. The CCSRMC algorithm outperforms the ACML algorithm because it uses label propagation to convert logical labels into numerical labels, which have richer potential semantics. However, this transformation process inevitably incurs loss and misclassification, and the CCSRMC algorithm does not take into account the non-equilibrium of label distribution, using cosine similarity directly to calculate label correlation. This is why the CNSF algorithm outperforms the CCSRMC algorithm.
3.
CNSF outperforms both the state-of-the-art CLML algorithm and GLSFL-LDCM algorithm in several metrics, indicating that considering causality and intra-class non-equilibrium methods can effectively improve the learning efficiency of class attributes.
4.
In Genbase dataset, the experimental results of the CNSF algorithm are slightly lower than those of the CLML algorithm simply because the CNSF algorithm does not take into account the shared features among the labels although it takes into account the asymmetric relationships among the labels. The experimental results of the CNSF algorithm on the Reference dataset are second only to CCSRMC, which is due to the fact that the CCSRMC algorithm sacrifices the experimental accuracy of the AP metrics, while the other four metrics are far better than the other algorithms.
5.
The average ranking results of the CNSF algorithm and the eight comparison algorithms on the five metrics given in Table 5. CNSF significantly dominates on all metrics, which also demonstrates that the use of intra-class density relations and causal correlations can effectively improve the classification performance of multi-label.

Table 4 Test results of each algorithm on five metrics

Full size table

Table 5 Average rank of each algorithm on five metrics

Full size table

5 Algorithm Analysis

5.1 Ablation Analysis

To demonstrate that the causal LC and intra-class non-equilibrium proposed in this paper can effectively improve the classification performance of multi-label learning, this paper decomposes the CNSF algorithm into a causal learning CSF algorithm that does not consider non-equilibrium and NSF algorithm that considers intra-class non-equilibrium without considering causality. Meanwhile, to demonstrate the superiority of intra-class non-equilibrium over inter-class non-equilibrium, we construct a causal-driven inter-class non-equilibrium label-specific feature learning (CNSF-IC) and the performance of each algorithm is shown in Fig. 2. The CNSF algorithm significantly outperforms other algorithms in five metrics. Intra-class non-equilibrium significantly outperforms inter-class non-equilibrium, which can learn richer LSF. CNSF outperformed NSF, suggesting that causal relationship can significantly improve the classification performance of multi-label learning on the base of intra-class non-equilibrium.

5.2 Parameter Sensitivity Analysis

The CNSF algorithm involves two important parameters $\alpha $ and $\beta $. Parameter $\alpha $ adjusts the causal co-occurrence matrix, and parameter $\beta $ adjusts the sparsity of features. To analyze the sensitivity of these parameters, we conducted visual experiments on the Birds dataset using the Bar3 function, with both parameters ranging from $\left({10}^{-10},{10}^{4}\right)$. Figure 3 shows that the sensitivity of the five metrics to $\alpha $ and $\beta $ is different, but overall, when $\alpha \in \left({10}^{-10},{10}^{-2}\right),\beta \in \left({10}^{-10},{10}^{4}\right)$, the experimental results are better. When $\alpha \ge {10}^{0}$, the experimental results become worse because the selected features are too sparse, leading to poor performance. However, when our algorithm was tested with $\alpha ={10}^{0}, \beta \in \left({10}^{2},{10}^{4}\right)$, the partial experimental results suddenly deteriorated due to the high correlation between causal density and the sparsity of the selected labels, causing the model to overfit. Therefore, we recommend setting the hyperparameters to $\alpha \in \left({10}^{-10},{10}^{-2}\right), \beta \in \left({10}^{-10},{10}^{4}\right)$.

5.3 Statistical Hypothesis Test

The statistical hypothesis test in this paper is based on a significance level of $\varphi =0.05$. The Friedman test [32] is used to assess the comprehensive performance of the CNSF algorithm on all data sets. The resulting ${F}_{F}$ was compared with the critical value $\left(CV\right)$ of the Friedman test. If ${F}_{F}$ is greater than Critical Value, the original hypothesis is rejected, vice versa. The experimental results are shown in Table 6. The ${F}_{F}$ of the CNSF algorithm on all metrics is greater than $CV$, so the original hypothesis is rejected for all of them.

Table 6 The Friedman statistics ${F}_{F}$ of the critical value and each evaluation metric

Full size table

The CNSF algorithm is then compared with the other eight algorithms on all datasets employing the Nemenyi test [32]. A significant difference exists when the mean ranking difference between the two algorithms on all datasets is greater than the Critical Difference (CD), and vice versa. The CD value is calculated as follows:

$$CD={q}_{\varphi }\sqrt{\frac{K\left(K+1\right)}{6N}}$$

(19)

where $K=9$, $N=13$, ${q}_{\varphi }=3.1020$, $CD=3.3321$. Figure 4 presents a comparison of the CNSF algorithm with other algorithms on five metrics, with the algorithm performance decreasing from left to right. On HL and AP metrics, there are no significant differences between CNSF, CCSRMC, CMLL. On OE metric there are no significant differences between CNSF, CCSRMC, CMLL and GLSFL-LDCM. On RL metric there are no significant differences between CNSF, CCSRMC and GLSFL-LDCM. There are no significant differences between CNSF and CCSRMC, ACML, FF-MLLA CLML and GLSFL-LDCM for CV metric. Otherwise, it is significantly different from the other algorithms in all metrics. The validity of the algorithm proposed in this paper can be seen from the results of statistical hypothesis test.

6 Conclusion

LC can be effective in improving the learning of LSF. However, the non-equilibrium distribution of the original labels can lead to spurious correlations in the computed LC. To obtain a more accurate LC, we first adopt the intra-class imbalance method to solve the problem of imbalanced distribution among the original labels. Then, the PC algorithm is used to calculate the adjacency matrix of the labels to express the causal relationship between them. Lastly, a causal density label relevance is constructed to guide the learning of the most accurate LSF. However, the algorithm also has some limitations. The first is that there are no manually annotated causal graphs available for learning in multi-label learning. The algorithm relies greatly on the independent and identically distributed hypothesis, which limits the research on causal multi-label learning to a certain extent. The second is that the method of calculating LC is overly dependent on the distribution of the original labels and does not correct for false causality in iterative optimization. How to calculate more accurate LC and causal relationships will be the next step in the research.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

References

Wei W, Wu Q, Chen D, Zhang YD, Liu W, Duan GH, Luo X (2021) Automatic image annotation based on an improved nearest neighbor technique with tag semantic extension model. Procedia Comput Sci 183:616–623
Article Google Scholar
Qian T, Li F, Zhang MS, Jin GN, Fan P, Dai WH (2022) Contrastive learning from label distribution: a case study on text classification. Neurocomputing 507:208–220
Article Google Scholar
Xia WQ, Zheng LY, Fang JB, Li FC, Zhou Y, Zeng ZY, Zhang B, Li ZR, Li HL, Zhu F (2022) PFmulDL: a novel strategy enabling multi-class and multi-label protein function annotation by integrating diverse deep learning methods. Comput Biol Med 145:105465
Article Google Scholar
Liu SH, Wang B, Liu B, Yang LT (2022) Multi-community graph convolution networks with decision fusion for personalized recommendation. In: Pacific-Asia conference on knowledge discovery and data mining, Chengdu, China, pp 16–28
Lin YJ, Liu HY, Zhao H, Hu QH, Zhu XQ, Wu XD (2022) Hierarchical feature selection based on label distribution learning. IEEE Trans Knowl Data Eng 35(6):5964–5976
Google Scholar
Li JH, Li PP, Hu XG, Yu K (2022) Learning common and label-specific features for multi-Label classification with correlation information. Pattern Recogn 121:108257
Article Google Scholar
Gao Y, Liu XY, Xiang JW (2022) Fault detection in gears using fault samples enlarged by a combination of numerical simulation and a generative adversarial network. IEEE/ASME Trans Mechatron 27(5):3798–3805
Article Google Scholar
Gao Y, Liu XY, Xiang JW (2020) FEM simulation-based generative adversarial networks to detect bearing faults. IEEE Trans Ind Inf 16(7):4961–4971
Article Google Scholar
Lou YX, Kumar A, Xiang JW (2022) Machinery fault diagnosis based on domain adaptation to bridge the gap between simulation and measured signals. IEEE Trans Instrum Meas 71(3514709):1–9
Google Scholar
Pei WW, Xue B, Zhang ML, Shang L, Yao X, Zhang Q (2023) A survey on unbalanced classification: how can evolutionary computation help? IEEE Trans Evol Comput. https://doi.org/10.1109/TEVC.2023.3257230
Article Google Scholar
Zhao DW, Li H, Lu YX, Sun D, Zhu D, Gao QW (2023) Multi-label weak-label learning via semantic reconstruction and label correlations. Inf Sci 623:379–401
Article Google Scholar
Yu K, Guo XJ, Lin L, Li JY, Wang H, Ling ZL, Wu XD (2020) Causality-based feature selection: methods and evaluations. ACM Comput Surv 53(5):1–36
Article Google Scholar
Zhang ML, Wu L (2015) Multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120
Article Google Scholar
Huang J, Li G, Huang Q, Wu XD (2015) Learning label specific features for multi-label classification. In: 2015 IEEE international conference on data mining, Atlantic City, NJ, USA, pp 181–190
Cheng YS, Qian K, Wang YB, Zhao DW (2019) Multi-label lazy learning approach based on firefly method. J Comput Appl 39(5):1305–1311
Google Scholar
Han HR, Huang MX, Zhang Y, Yang XG, Feng WG (2019) Multi-label learning with label specific features using correlation information. IEEE Access 7:11474–11484
Article Google Scholar
Huang J, Qin F, Zheng X, Cheng ZK, Yuan ZX, Zhang WG, Huang QM (2019) Improving multi-label classification with missing labels by learning label-specific features. Inf Sci 492:124–146
Article MathSciNet Google Scholar
Kumar V, Pujari AK, Padmanabhan V, Kagita VR (2019) Group preserving label embedding for multi-label classification. Pattern Recognit 90:23–34
Article Google Scholar
Wang YB, Pei GS, Cheng YS (2020) Group-label-specific features learning method based on label-density classification margin. J Electron Inf Technol 42(5):1179–1187
Google Scholar
Liu B, Blekas K, Tsoumakas G (2022) Multi-label sampling based on local label imbalance. Pattern Recognit 122:108294
Article Google Scholar
Bao JC, Wang YB, Cheng YS (2022) Asymmetry label correlation for multi-label learning. Appl Intell 55:6093–6105
Article Google Scholar
Zhang C, Cheng YS, Wang YB, Xu YT (2022) Interactive causal correlation space reshape for multi-label classification. Int J Interact Multimed Artif Intell 7(5):107–120
Google Scholar
Zhao DW, Gao QW, Lu YX, Sun D (2022) Learning multi-label label-specific features via global and local label correlations. Soft Comput 26:2225–2239
Article Google Scholar
Wu XY, Jiang BB, Yu K, Miao CY, Chen HH (2022) Accurate Markov boundary discovery for causal feature selection. IEEE Trans Cybern 50(12):4983–4996
Article Google Scholar
Yang S, Wang H, Yu K, Cao FY, Wu XD (2022) Towards efficient local causal structure learning. IEEE Trans Big Data 8(6):1592–1609
Google Scholar
Yu K, Cai MZ, Wu XY, Liu L, Li JH (2021) Multilabel feature selection: a local causal structure learning approach. IEEE Trans Neural Netw Learn Syst 34(6):3044–3057
Article Google Scholar
Spirtes P, Glymour CN, Scheines R, Heckerman D (2000) Causation, prediction, and search. MIT press, Cambridge
Google Scholar
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci 2(1):183–202
Article MathSciNet Google Scholar
Lin ZC, Ganesh A, Wright J, Wu LQ, Chen MM, Ma Y (2009) Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix. Coord Sci Lab Rep 246:2214
Google Scholar
Zhao DW, Gao QW, Lu YX, Sun D (2022) Learning view-specific labels and label-feature dependence maximization for multi-view multi-label classification. Appl Soft Comput 124:109071
Article Google Scholar
Rastogi R, Kumar S (2023) Discriminatory label-specific weights for multi-label learning with missing labels. Neural Process Lett 55:1397–1431
Article Google Scholar
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30
MathSciNet Google Scholar

Download references

Funding

This work was supported by the Science and Technology on Parallel and Distributed Processing Laboratory (No. WDZC202252501), National Natural Science Foundation of Anhui under Grant (No. 2108085MF216) and Anqing Normal University Graduate Innovation Fund (No. 2021yjsXSCX017).

Author information

Authors and Affiliations

School of Computer and Information, Anqing Normal University, Anqing, 246133, China
Wenxin Ge, Yibin Wang & Yusheng Cheng
The Key Laboratory of Intelligent Perception and Computing of Anhui Province, Anqing, 246133, China
Yibin Wang & Yusheng Cheng
School of Computer Engineering, Anhui Sanlian University, Hefei, 230601, China
Yuting Xu

Authors

Wenxin Ge
View author publications
You can also search for this author in PubMed Google Scholar
Yibin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuting Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yusheng Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors confirm contribution to the paper as follows: study conception and design: WG and YC; analysis and interpretation of results: WG, YW and YX; draft manuscript preparation: WG, YW, YX and YC. All authors reviewed the results and approved the final version of the manuscript.

Corresponding author

Correspondence to Yusheng Cheng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Ethical Approval

This manuscript does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ge, W., Wang, Y., Xu, Y. et al. Causality-Driven Intra-class Non-equilibrium Label-Specific Features Learning. Neural Process Lett 56, 120 (2024). https://doi.org/10.1007/s11063-024-11439-w

Download citation

Accepted: 13 December 2023
Published: 21 March 2024
DOI: https://doi.org/10.1007/s11063-024-11439-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Causality-Driven Intra-class Non-equilibrium Label-Specific Features Learning

Abstract

Similar content being viewed by others

Bi-directional mapping for multi-label learning of label-specific features

Weakly-supervised label distribution feature selection via label-specific features and label correlation

Multi-label causal feature selection based on neighbourhood mutual information

1 Introduction

2 Related Work