Abstract
Multi-view multi-label (MVML) learning is a framework for solving the problem of associating a single instance with a set of class labels in the presence of multiple types of data features. The extraction of shared features among multiple views for label prediction is a common MVML learning method. However, previous approaches assumed that the number and association degree of shared features were the same across views. In fact, they differ in the number and degree of association. The above assumption can lead to a poor communicability of the views. Therefore, this paper proposes an MVML learning method based on the inconsistent shared features extracted by the graph attention model. The first step is to extract the shared and private features of multiple views. Next, the graph attention mechanism is adopted to learn the association degree of shared features of different views and calculate the adjacency matrix and attention coefficient. The number of associations is determined by taking the obtained adjacency matrix as a mask matrix, while the association degree of shared features is measured by the attention weight matrix. Finally, the new shared features are obtained for multi-label prediction. We conducted experiments on seven MVML datasets to compare the proposed algorithm with seven advanced algorithms. The experimental results demonstrate the advantages of our algorithm.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Multi-label learning [1] is a basic framework for solving real-life problems in which an instance is associated with multiple labels. With the increase of data complexity and the diversity of extraction methods, traditional multi-label learning becomes less efficient. This is because multiple types of data features may exist while an instance is associated with a set of class labels [2,3,4]. For example, a picture of a landscape can be labeled as “sky”, “cloud” or “sunrise”. Meanwhile, the features can be extracted based on the color intuitive characteristics of the picture and the color space of three primary colors. Such a problem is called MVML learning [5].
A common approach to solving MVML classification is by extracting shared features and private features. To extract the shared features of multiple views, Zhao et al. [6] proposed a two-step approach. Firstly, Hilbert–Schmidt Independence Criterion (HSIC) [7] was used to extract the view consistency information in the mapping process. Then, MVML learning was explored under missing labels. Tan [8] et al. developed an explicit model to explore the shared subspace using non-negative matrix decomposition [9] to obtain the shared features. The method also focused on exploring the private features of different views and assigning weights to different views to enhance the classification performance. To explore the consistency and complementarity among different views, an MVML algorithm with matrix complementation [10] was proposed. This algorithm projected the information from different views into a low-dimensional space to learn the common representation. Then, it learned the reconstruction weights of each view to obtain complementary information. A variety of existing methods map the feature vectors of different dimensions of each view to a shared subspace to obtain the shared features. However, the mapping matrix of feature vectors of each view is different from each other, so it is difficult to ensure that the truly shared features are mined. To address this issue, Wu et al. [11] obfuscated the mapping process of features from different views to a shared subspace to extract shared features by minimizing the adversarial loss. Also, orthogonal constraints were applied to the obtained shared features to remove the shared features in the original features and obtained the private features.
The above methods of extracting shared features from views are actually based on the fact that shared features have the same number and degree of association in different views. The degree of association can be used to measure the correlation of features shared by different views. The fact that the number and degree of the association are inconsistent leads to the extraction of suboptimal shared features. The suboptimal shared features make the information interaction between views inaccurate, which is an indication of the deterioration of view communicability. Thus, we call this phenomenon the inconsistency of shared features. For instance, the Corel5k dataset contains information from six views, which are DenseHue, DenseSift, Gist, HSV, Lab, and RGB. The shared features were extracted using the method mentioned by Wu et al. [11]. The degree of association and the number of associations of the shared features of view 1 DenseHue and view 2 DenseSift are not the same as those of view 1 DenseHue and view 3 Gist. As shown in Fig. 1a and b, the shared features of 16 samples in these three views are randomly taken out and the degree of association of the shared features is measured by calculating the cosine similarity. The association is considered when the calculated correlation degree is greater than 0.5 for the shared features of different views. Figure 1c shows that the shared features of the 1st sample in view 1 are associated with the shared features of the 4th and 11th samples in view 2, and Fig. 1d shows that the shared features of the 1st sample in view 1 are associated with the shared features of the 4th, 7th, 9th, 13th, and 14th samples in view 3. The distance between them then indicates the degree of association.
To solve this problem, this paper proposes the multi-view multi-label algorithm with shared features inconsistency (MMSFI-C). The framework of the algorithm is shown in Fig. 2. In the first part, the shared features and private features of multi-view are extracted. [11]. The extraction of shared features here is also based on the previously mentioned assumption, which produces the results in Fig. 1. In this paper, this problem is solved by the graph attention mechanism. In the second part, the obtained shared features are input into the graph attention network (GAT). GAT [12] is a representative graph convolutional network. By learning the weights of neighboring nodes, GAT can achieve weighted aggregation of neighboring nodes. MMSFI-C learns the association degree of features shared between different views through the graph attention mechanism. Firstly, the adjacency matrix and the attention coefficient are calculated. Then, the number of associations is determined by taking the obtained adjacency matrix as a mask matrix. Subsequently, the attention weight matrix is used to measure the degree of association of the shared features and to obtain new shared features. In the third part, the new shared features obtained are spliced with the previously extracted private features based on the feature dimension. In the last part, the final multi-label prediction is performed through a fully connected layer.
The main contributions of this paper are concluded as below:
-
(1)
An MVML learning method based on GAT is designed, which can solve the problem of extracting shared features with inconsistency.
-
(2)
The GAT is used to learn the association degree of features shared among different views, and the adjacency matrix and attention coefficients are calculated. The number of associations is determined by taking the adjacency matrix as a mask matrix. The attention weights are used to measure the degree of association between the shared features, and then new shared features are obtained.
-
(3)
The comparative experiments are performed on seven MVML datasets, and the experiment results show that MMSFI-C achieves significantly better performance.
The remainder of this paper is organized as follows. Section 2 describes the recent work of multi-view multi-label learning and graph attention mechanism; Sect. 3 describes the concrete model structure and realization method; Sect. 4 reports the detailed experimental results with the experimental analysis; Sect. 5 concludes the paper, and Sect. 6 discusses the application of multi-view multi-label classification and research directions.
2 Related Work
2.1 Multi-view Multi-label Learning
The existing MVML learning algorithms are mainly divided into two categories. The first class of algorithms performs multi-label prediction for each view and then fuses the prediction results of all views. Zhu et al. [13] developed a label embedding method based on view relevance. They used HSIC to explore the consistency information of the views and finally combined the prediction results for each view. To explore the contribution of distinct views, Huang et al. [14] constructed classifiers for each view based on label relevance and view consistency, and they assigned corresponding weights to the views according to their contributions. To further fuse the information between multi-view features and labels, Zhao et al. [15] proposed a two-stage algorithm. They extracted view-specific labels and then maximized the dependencies. Because the two-step algorithm model is suboptimal, an end-to-end single hidden layer feedforward neural network framework was proposed by Zhao et al. [16]. The performance of the multi-label classifier was improved by fully exploiting the consistency and diversity information of the views.
Another category of algorithms converts multi-view multi-label problems into single-view multi-label problems. Learning a common representation of multi-view, followed by multi-label prediction using traditional methods. Zhu et al. [5] learned label correlations between global and local; then, they used a low-rank matrix to complement incomplete views; finally, a consistent representation of views was learned to encode the complementary information between views. To avoid noise and redundancy generated by transforming a single view, a multi-view multi-label algorithm was proposed for image classification [17], which maps information from different views to a shared space while ensuring data sparsity. Zhang et al. [18] proposed an algorithm based on matrix decomposition to encode the consistency and complementarity of different views, which in turn leads to a consensus multi-view representation. However, the method cannot extract shared features of multiple views for a large number of views. Therefore, Zhang et al. [19] introduced tensor decomposition to obtain a shared space of higher-order relations. The enhancement of labels under multiple views was proposed for the first time to learn the shared features among views.
The majority of established multi-view multi-label methods use the obtained shared features for label prediction, and the importance of the shared features of the different views obtained by each of these methods is assumed to be the same. In fact, the results of Fig. 1 show that they are different. As the number and degree of association of different view features may vary, ignoring this issue can cause poor communication of views and thus adversely affect the label prediction results. Therefore, we propose an MVML framework based on GAT to solve this problem.
2.2 Graph Attention Network
The graph attention mechanism is a graph convolutional neural network [20]. It was proposed to achieve better neighborhood weight aggregation with the attention mechanism [21], by learning the neighborhood weights to weigh the summation of neighborhood node features. The graph attention mechanism has also been used to solve the NP problem in graph theory. The NP-hard problem (Nondeterministic Polynomial-time hard problem) is a class of problems in computational complexity theory. LEI et al. [22] proposed a deep learning framework that combines residual networks with graph attention mechanisms, which is highly generalizable and applicable to a wide range of problems. Sun et al. [23] combined self-supervised learning with graph convolutional neural networks to learn the minimum set of parses for the graph. Inspired by these works, we employ a graph attention mechanism to address the NP-hard problem that may arise in different datasets.
Currently, graph attention neural network plays a broad role in multi-label learning. Hu et al. [24] proposed a multi-label image learning method based on graph attention mechanism to reduce the misconnections of objects in images and avoid the influence of noise to obtain the dependencies of input objects. Pal et al. [25] used a graph attention mechanism for multi-label text classification to select important features by assigning different weights to the labels based on the correlation between the feature learning labels.
Motivated by the existing studies, this paper attempts to apply the graph structure to multi-view multi-label data. As a special data structure, the graph is a non-Euclidean space. It has two characteristics: (1) variable local input dimension, which is manifested by different numbers of neighbor nodes for each node; (2) disorderly ranking, which shows that there is only a connection between nodes without sequential order. Multi-view multi-label data has these two characteristics, as expressed by the fact that the number of neighboring features of each view feature is different and there is some degree of correlation between them. Therefore, this paper applies the graph attention mechanism to solve the problem of poor view communication caused by the existing methods in extracting shared features.
3 Proposed Approach
3.1 Problem Statement and Notations
\(F={\left\{{{\varvec{X}}}^{v}\right\}}_{v=1}^{h}\) indicates the original feature space data that includes \(h\) views. \({{\varvec{X}}}^{v}={\left[{{\varvec{x}}}_{1},{{\varvec{x}}}_{2},\dots ,{{\varvec{x}}}_{n}\right]}^{{\text{T}}}\in {\mathbb{R}}^{n\times d}\) represents the feature space under the \(v\)-th view with \(n\) samples. \({\varvec{Y}}\in {\left\{\mathrm{0,1}\right\}}^{n\times q}\) is the label space with labels of class \(q\). \({{\varvec{y}}}_{ij}=1\) indicates that the \(i\)-th sample contains \(j\) labels; \({{\varvec{y}}}_{ij}=0\) indicates the \(j\)-th label of the \(i\)-th sample does not contain information. \({\widetilde{{\varvec{y}}}}_{ij}\) indicates the output of the predicted label. Our task is to build a model to make multi-label predictions for unknown examples that contain multiple views.
3.2 Extraction of Shared and Private Features
The current MVML algorithms usually map the feature vectors from different views into a shared subspace when acquiring the features shared to multiple views [3, 10, 18, 26]. However, there is a difference in the mapping matrix of each view feature vector. As a result, this mapping method makes it uncertain whether true shared information is available because the mapping process for different views is not dependent on each other. Also, the private contributions of individual views should be considered when making multi-label predictions. We confuse the mapping process of view features to shared spaces by the minimization of adversarial loss which prevents the discriminator from determining to which view the input shared features belong [11, 27]. The features obtained by this method do not consist of private features, achieving the purpose to extract shared features.
First, the \(k\)-dimensional shared features \({{\varvec{c}}}^{v}\) of the feature space \({{\varvec{X}}}^{v}\) under the \(v\)-th view are extracted by the \(ReLu\) activation function as a shared subspace extraction layer \(f\left(\cdot \right)\). The extraction process is by \(f\left({p}^{v}\left({{\varvec{X}}}^{v}\right)\right)\), and the projection of feature vectors of different dimensions to \(k\) dimensions is described by \({p}^{v}\left(\cdot \right)\).
Next, the training set \(N=\left\{\left({{\varvec{c}}}_{i}^{v},{{\varvec{g}}}_{i}\right)|1\le v\le h,1\le i\le n\right\}\) is constructed, and \({{\varvec{g}}}_{i}\) represents the \(h\) dimensions view label vector of \({{\varvec{c}}}_{i}^{v}\). The discriminator \(D\left(\cdot \right)\) is used for discrimination, and its value is 1 if \({{\varvec{c}}}_{i}^{v}\) belongs to the \(v\)-th view; otherwise, its value is 0. The value of the output is denoted as \({\widetilde{{\varvec{g}}}}_{i}\), \({\widetilde{{\varvec{g}}}}_{i}=D\left({{\varvec{c}}}_{i}^{v}\right)\). The loss \({l}_{adv}\) for this part of can be written as:
where \(\ell \left(\cdot \right)={e}^{-x}\). However, the extraction of shared features following the above method may encounter a problem since noise that does not include semantic information may also confuse the discriminator. This paper adds the shared subspace multi-label loss \({l}_{sml}\) in order to ensure that the input information contains semantics. The training set \(M = \left\{ {\left( {{\varvec{c}}_{i}^{v} ,{\varvec{y}}_{i} } \right)\left| {1 \le v \le h,1 \le i \le n} \right.} \right\}\) is constructed. The output value after the prediction layer \(P\) is denoted as\({\widetilde{{\varvec{y}}}}_{i}{\prime}\),\({\widetilde{{\varvec{y}}}}_{i}{\prime}=P\left({{\varvec{c}}}_{i}^{v}\right)\). The loss term \({l}_{sml}\) for this part can be expressed as:
The loss term \({l}_{shared}\) for shared features is expressed in Eq. (3):
MMSFI-C removes shared features out of the original features, and remaining features are private features. This is accomplished by imposing orthogonal constraints on the extracted shared features. The \(k\)-dimensional private features \({{\varvec{q}}}^{v}\) of the feature space \({{\varvec{X}}}^{v}\) under the \(v\)-th view are extracted by private space extraction layer \(w\left(\cdot \right)\) consisting of a fully connected layer and a \(ReLu\) activation function, \({{\varvec{q}}}^{v}=w\left({{\varvec{X}}}^{v}\right)\). \({\varvec{C}}\) is a \(k\)-dimensional feature vector that includes features shared by all views, i.e., \({\varvec{C}}=\sum_{v=1}^{h}{{\varvec{c}}}^{v}\). The loss term \({l}_{private}\) for private features is expressed as:
3.3 Shared Features Based on Graph Attention Mechanism
For the shared features obtained in Sect. 3.2, the degree of association between the shared features of different views cannot be obtained, i.e., each associated view is treated equally. However, different association views produce different roles in label prediction. To solve this problem, the graph attention mechanism is adopted in this paper.
The graph attention mechanism characterizes the top and bottom representation vectors of the current node by aggregating the feature vectors of the current node with those of its related nodes in an average manner. This means that the neighborhood nodes are aggregated through a self-attention mechanism to achieve adaptive matching of weights with different neighborhood nodes.
Inspired by this, our study characterizes the feature vectors of different views to reduce the high dimensionality of multi-view multi-label. As mentioned before, the number of associations of features is not the same between different views. In graph theory, the degree of a node is the number of nodes of adjacent nodes, which is similar to the number of neighboring view features of a view feature in a multi-view shared feature. Thus, this paper adopts the graph attention mechanism to aggregate the associated views to adaptively match the weights of different associated views, thus enhancing the communication between views.
The shared features \({\varvec{C}}\) are taken as the input of the graph attention layer, and a weight matrix \({\varvec{W}}\in {\mathbb{R}}^{k\times k}\) is used to act on each feature. Then, the attention coefficients from the shared features of different views are calculated using self-attention. \(a\left(\cdot \right)\) denotes the self-attention mechanism. \({e}_{ij}\) represents the importance of view \(j\) to view \(i\), and it can be expressed as:
Based on \({\varvec{W}}\) and \(a\left(\cdot \right)\), the model can learn the attention coefficients between view \(i\) and view \(j\). Then, the obtained attention coefficients are non-linearized as follows:
Finally, the attention coefficients are normalized by \(softmax\) with the following expression:
To improve the generalization ability of the attention mechanism, this paper uses multi-head attention layers. \(H\) groups of self-attention layers that are independent of each other are used, and then their results are combined. The shared features \({{\varvec{s}}}^{i}\) of the \(i\)-th view of the output are denoted as:
where Eq. (9) is used to calculate the adjacency matrix \({\varvec{A}}\) of the originally shared features, and \({\varvec{A}}\) is used as a mask matrix. Eqs. (10) and (11) are the processes of calculating the diagonal matrix \({\varvec{G}}\). The activation function \(\sigma \left(\cdot \right)\) determines whether the attention coefficient \({a}_{ij}\) of the current view should be retained by this mask matrix, the diagonal matrix of shared features are used in Eq. (10). Here, \(H\) is set to the number of views \(h\) so that more information about the views can be aggregated. The dropout layer is used to improve the generalization ability of the model especially on small-size datasets. Adding a dropout layer is essentially a random sampling of the associated views. The new shared features of the output are denoted as \({\varvec{S}}=\sum_{i=1}^{h}{{\varvec{s}}}^{i}\). These features are obtained by fusing the amount of association and the degree of association of the view features.
3.4 Label Prediction
\(P\left(\cdot \right)\) Is the fully connected layer, which is used as the final label prediction layer in this paper. The dimension of the input matrix is \(n\times \left(h+1\right)k\), and the dimension of the output matrix is \(n\times q\). As shown in Eq. (12), \({{\varvec{P}}}_{final}\) indicates the feature acquired by merging the shared features with the private features based on the feature dimension. \({{\varvec{T}}}_{out}\) denotes the output result of the prediction layer \(P\left(\cdot \right)\) to \({{\varvec{P}}}_{final}\). \({{\varvec{T}}}_{pre}\) in Eq. (14) indicates the labeling judgment of a certain sample by the symbolic function, and the meaning of 1 and 0 values output by the symbolic function in Eq. (15) is the same as that in Sect. 3.1.
3.5 MMSFI-C
In this paper, a graph attention-based neural network framework called MMSFI-C is designed. The framework consists of two main parts: extracting shared and private features of multi-view, and using the graph attention mechanism to obtain new shared features considering the existence of non-consistency in the shared features. Finally, the obtained shared private features are combined based on the feature dimension for multi-label prediction.
The total loss function of MMSFI-C can be expressed as:
where \(\lambda \),\(\gamma \) are the weight parameters that control the loss term; \({l}_{shared}\) denotes the loss value when shared features are extracted; \({l}_{private}\) denotes the loss value when private features are extracted; \({l}_{ml}\) handles the multi-label loss of the model, and it is calculated as follows:
The Adam optimization method [28] is used for the MMSFI-C algorithm. In the following, an overview of the proposed algorithm forward propagation process is presented.
3.6 Complexity Analysis
The complexity of the algorithm MMSFI-C is divided into two main parts, which are feature extraction and measuring the association degree of features shared by various views through the graph attention mechanism. The complexity is \(\mathcal{O}\left(hn{d}^{2}k+n{k}^{3}\right)\) when extracting the shared features and \(\mathcal{O}\left(hn{d}^{2}k+h{n}^{2}k\right)\) in extracting the private features. The complexity of measuring the association degree of features shared by various views through the graph attention mechanism is \(\mathcal{O}\left(h{n}^{4}{k}^{2}+h{n}^{4}k\right)\). Hence, the final complexity is \(\mathcal{O}\left(nk\left(2h{d}^{2}+{k}^{2}+hn+h{n}^{3}k+h{n}^{3}\right)\right)\), which is mainly affected by the size of the sample size \(n\). The time cost of MMSFI-C on the seven datasets mentioned in Sect. 4.1 is shown in Table 1. As can be seen from the table, the algorithm takes less than 5 s on Emotions, the dataset with the smallest sample number, and about 30 min on Mirflickr, the dataset with the largest sample number.
4 Experiments
4.1 Datasets
The experiments have been performed on seven MVML datasets in order to verify the performance of the algorithms. These datasets were sourced from MULAN1 and LEAR2 and the specific information is shown in Table 2. Figure 3 shows the example images from the Pascal07 dataset.
4.2 Comparing Algorithms
To verify the performance of MMSFI-C, seven other algorithms are adopted for comparison. The algorithms are presented as below.
-
1.
SIMM1 [11]: It is a neural network framework proposed to learn shared features and private features among multi-view for multi-label prediction. Here, \(\alpha \) is taken to be 1, and \(\beta \in \left[{10}^{-4},{ 10}^{-1}\right]\).
-
2.
MMFA2 [29]: This algorithm discusses the importance of each view feature to label prediction in the MVML problem. \(\lambda , \gamma \in \left[{10}^{-4},{10}^{4}\right]\).
-
3.
ICM2L3 [8]: It explores individual and shared features of MVML data and explores rare labels. Among this algorithm, \(\alpha \in \left[ 0.1, 1\right]\), \(\beta \in \left[0.1, 2\right]\).
-
4.
iMvWL4 [30]: It learns shared representations with weak labels, local label relevance, and incomplete views. Among this algorithm, \(\alpha \), \(\beta \in \left[{10}^{-5},{ 10}^{0}\right]\).
-
5.
VLSF5 [14]: It learns label-specific view features and combines these features according to view contributions for the final prediction. In this algorithm, \({\lambda }_{1}\in \left[{10}^{-7},{ 10}^{-1}\right]\), \({\lambda }_{2}\in \left[{10}^{-1},{ 10}^{5}\right]\), \({\lambda }_{3}\in \left[{10}^{4}, { 10}^{6}\right]\), and \({\lambda }_{4}\in \left[{10}^{3},{ 10}^{8}\right]\).
-
6.
CDMM6 [16]: It is a neural network framework without iteration which is used for MVML learning to explore the consistency and variety of view data. Among this algorithm, \(\alpha \in \left[{10}^{-10},{ 10}^{-5}\right]\), \(\lambda \in \left[{10}^{1},{ 10}^{ 7}\right]\), \(C\in \left[{10}^{-5},{ 10}^{5}\right]\), \(\sigma \in \left[{10}^{-2},{ 10}^{2}\right]\), and \(\eta \in \left[0.5, 0.8\right]\).
-
7.
TM3L7 [6]: It is a two-step algorithm based on subspace learning to deal with the missing label problem in multi-view multi-label learning. In this algorithm, \(\alpha ,C,\lambda \in \left[{10}^{-5},{ 10}^{5}\right]\), and \(\beta \) is set to 1.
-
8.
MMSFI-C8: In our algorithm, \(\lambda , \gamma \in \left[{10}^{-4},{ 10}^{4}\right]\).
1 code: http://palm.seu.edu.cn/zhangml/
2 code: https://github.com/chengyshaq/MMFA
3 code: http://mlda.swu.edu.cn/codes.php?name=ICM2L.
4 code: http://mlda.swu.edu.cn/codes.php?name=iMvWL.
5 code: http://www.escience.cn/people/huangjun/index.html.
6 code: https://github.com/chengyshaq/CDMM
7 code: https://github.com/chengyshaq/TM3L
8 code: https://github.com/chengyshaq/MMSFI-C
4.3 Evaluation Metrics
This paper uses six common evaluation metrics to evaluate the performance of the above algorithms, namely Hamming Loss (HL), Average Precision (AP), One Error (OE), Ranking Loss (RL), Coverage (CV), and Subset Accuracy (SA). HL examines the misclassification of samples on a single label; OE indicates the case where the topmost label is not part of the set of relevant labels; CV indicates the search depth required to cover all relevant labels, and RL examines the case where there is a sorting error in the sorting sequence of the category labels of the samples. The smaller the value of the four metrics, the better the performance. Besides, higher values of AP and SA indicate improved performance. The details of the calculation of these evaluation metrics with reference to [31, 32].
4.4 Experimental Results and Analysis
The experiments are performed on seven MVML datasets. For each algorithm, five-fold cross-validation is performed, and the six evaluation metric values are recorded. In Table 3, the experimental results are presented as mean ± standard deviation, and all optimal values are marked in bold.
It can be seen from Table 3 that MMSFI-C achieves the best performance on most of the datasets. It achieves the second-best performance on the Yeast dataset because this dataset is of the biological type, and it contains more diversity information. CDMM targets at processing the datasets containing diversity information, while MMSFI-C targets at processing the datasets containing more consistency information. As shown in Table 2, except Yeast, the other six datasets are image and audio types, and these data contain more consistency information. Therefore, MMSFI-C is superior to other algorithms on these six datasets. This performance superiority reflects that MMSFI-C extracts shared features, considers the communicative nature between views, and assigns attention weights to the views.
To further compare the performance of the algorithms, a statistical hypothesis test is performed. First, the Friedman [33] test is conducted to determine whether the performance of all algorithms differs. Table 4 lists the results at a significance level of 0.05 at each evaluation metric. It can be seen from this table that the performance of the eight algorithms is significantly different, and the original hypothesis is rejected.
Next, the Nemenyi test is conducted to compare the performance of each algorithm. At the confidence level \(\alpha =0.05\), \(k=8\), \(N=7\), \({q}_{\alpha }=3.031\), and the critical difference \(\left({\text{CD}}\right)\) value of 3.9685, the test results are shown in Fig. 4, where each subplot presents the experimental results under the six evaluation metrics. Two algorithms are considered to be significantly different if their difference reaches above a critical value.
In each subplot, the average ranking differences between algorithms within \({\text{CD}}\) values are connected by colored solid lines, and the performance of these algorithms shows a sequential decreasing trend from left to right. Except for the SA, MMSFI-C is on the far left in the remaining subplots, indicating that MMSFI-C achieves the best performance for most of the evaluation metrics.
The following observations are obtained from the above experiment results:
-
(1)
Compared with VLSF and CDMM, MMSFI-C produces significantly superior results on the seven datasets. It is due to the fact that VLSF and CDMM only assign weights to views based on label prediction results, while MMSFI-C measures the association degree of shared features among various views and assigns attention weights based on the measurement.
-
(2)
All of MMSFI-C, MMFA, SIMM, ICM2L, TM3L and iMvWL use shared subspace to extract the shared features, but MMSFI-C, MMFA, SIMM outperform the other three algorithms. This is because the shared features obtained by MMSFI-C, MMFA, SIMM through the minimization of adversarial losses are more realistic and reliable.
-
(3)
MMSFI-C outperforms SIMM and MMFA because it considers the inconsistency of shared features and uses the graph attention mechanism to overcome the drawbacks of the traditional method in extracting shared features.
-
(4)
MMSFI-C performs the best in five evaluation metrics and is inferior to CDMM in SA. This is because CDMM is good at mining rare labels, while MMSFI-C cannot.
-
(5)
MMSFI-C, MMFA, CDMM, and SIMM all explore the shared features and private features among views, and they have better performance than the other algorithms. It can be seen that fully exploiting the information from different views is the key to solving the MVML problem, and measuring the association degree of shared features is conducive to improving the MVML algorithm.
4.5 Component Analysis
For the MMSFI-C algorithm, a component analysis is performed to evaluate the effectiveness of the part that learns the degree of association between features shared by different views through the graph attention mechanism. In this paper, the algorithm for removing part II in Fig. 2 is called MMSF and was experimented on seven datasets. The experimental results of MMSF are compared with those of MMSFI-C, and the comparison is shown in Fig. 5. It can be seen that MMSFI-C performs better in each evaluation metric on all datasets. This further illustrates that learning the degree of association between features shared by different views in multi-view multi-label learning can effectively improve the performance of the classifier.
4.6 Parameter Sensitivity Analysis
The two parameters \(\lambda \) and \(\gamma \) are used in MMSFI-C, where \(\lambda \) controls the number of shared features, and \(\gamma \) controls the number of private features. To test the sensitivity of the parameters, experiments are conducted on the Emotions dataset, and the experimental results are shown in Fig. 6. The values of \(\lambda \) and \(\gamma \) fall in the range from \({10}^{-4}\) to \({10}^{4}\). MMSFI-C performs well in all evaluation metrics when \(\lambda , \gamma \in \left[{10}^{-4},{10}^{-2}\right]\). As the value of \(\lambda \) becomes smaller, the performance decreases accordingly. This is because the graph attention weights learned by the algorithm decay in this case. When the value of \(\gamma \) is too large, the algorithm also performs poorly because the view communicability decreases when there are too many private features. The experiments are conducted on other datasets as well, and similar experimental results are obtained.
5 Conclusion
This paper investigates the inconsistency of features shared by different views. The previous algorithms presume that the amount of association and the association degree of features are the same between different views, and they ignore the communicability of different views. In fact, the number and the degree of associations among shared features are not exactly the same. To this end, this paper proposes an algorithm called MMSFI-C to solve the above issue. First, shared and private features between views are extracted. Next, the graph attention mechanism is adopted to learn the association degree of features shared by various views and calculate the adjacency matrix and attention coefficients. The adjacency matrix is used as a mask matrix to determine the final attention weight matrix. Subsequently, the new shared features are obtained by using the obtained attention weights to measure the association degree of the shared features from various views. Finally, the private features are combined with the new shared features for multi-label prediction.
In this paper, experiments are conducted on seven MVML datasets and seven state-of-the-art MVML algorithms are selected for comparative analysis with MMSFI-C. The results of label prediction are evaluated under six evaluation metrics, and the experimental results demonstrate the rationality and effectiveness of our algorithms.
6 Discussion
Multi-view multi-label classification finds extensive applications in fields such as multimedia data processing, bioinformatics, and recommendation systems. For example, in recommendation systems, algorithms can extract multiple interest-view features from users and perform multi-label classification on users. Although there is some correlation among the multiple interest views of users, the degree of correlation varies. Therefore, if the inconsistency among different views is ignored, and their shared features are directly extracted for multi-label prediction, it can impact the accuracy of prediction results. Based on this, the model MMSFI-C is developed to solve this problem.
In this paper, we examine the shortcomings of existing methods for extracting shared features (which do not take into account the non-consistency of features shared by different views) and compensate for this by utilizing the graph attention mechanism. However, this paper processes the shared features after they have been extracted and does not address this issue from the perspective of extracting the features themselves. In the future this will be investigated from this perspective.
Another issue to consider is that the extraction of shared and private features in previous research work ignores the location information of the features, i.e., it is not stated from which view these features come from, which in turn may affect the multi-label classification results. The initial idea is to construct a location information matrix to solve this problem.
Data Availability
In this paper, all benchmark datasets are publicly downloaded from the Internet. The specific URLs for these datasets and the code URL for the algorithm proposed in this paper are given in Sect. 4.
References
Zhang ML, Zhou ZH (2013) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837
Yin J, Zhang WT (2022) Multi-view multi-label learning with double orders manifold preserving. Appl Intell 53:1–14
Zhang YS, Wu J, Cai ZH, Yu PS (2020) Multi-view multi-label learning with sparse feature selection for image annotation. IEEE Trans Multimed 22(11):2844–2857
Zhu CM, Wang PH, Ma L, Zhou RG, Wei L (2020) Global and local multi-view multi-label learning with incomplete views and labels. Neural Comput Appl 32(18):15007–15028
Zhu CM, Miao DQ, Wang Z, Zhou RG, Wei L, Zhang XF (2020) Global and local multi-view multi-label learning. Neurocomputing 371:67–77
Zhao DW, Gao QW, Lu YX, Sun D (2021) Two-step multi-view and multi-label learning with missing label via subspace learning. Appl Soft Comput 102:107120
Cao XC, Zhang CQ, Fu HZ, Liu S, Zhang H (2015) Diversity-induced multi-view subspace clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586–594
Tan QY, Yu GX, Wang J, Domeniconi C, Zhang XL (2019) Individuality-and commonality-based multi-view multi-label learning. IEEE Trans Cybern 51(3):1716–1727
Lin ZJ, Ding GG, Hu MQ, Wang JM (2014) Multi-label classification via feature-aware implicit label space encoding. In: Proceedings of the international conference on machine learning, pp 325–333
Liu M, Luo Y, Tao DC, Xu C, Wen YG (2015) Low-rank multi-view learning in matrix completion for multi-label image classification. In: Proceedings of the AAAI conference on artificial intelligence, pp 2778–2784.
Wu X, Chen QG, Hu Y, Wang DB, Chang XD, Wang XB, Zhang ML (2019) Multi-view multi-label learning with view-specific information extraction. In: Proceedings of the international joint conference on artificial intelligence, pp 3884–3890
Veličković P, Cucurull G, Casanova A, Romero A, Lio` P, Bengio Y (2017) Graph attention networks. Int Conf Learn Represent 1050:20
Zhu PF, Hu Q, Hu QH, Zhang CQ, Feng ZZ (2018) Multi-view label embedding. Pattern Recogn 84:126–135
Huang J, Qu XW, Li GR, Qin F, Zheng X, Huang QM (2019) Multi-view multi-label learning with view-label-specific features. IEEE Access 7:100979–100992
Zhao DW, Gao QW, Lu YX, Sun D (2022) Learning view-specific labels and label-feature dependence maximization for multi-view multi-label classification. Appl Soft Comput 124:109071
Zhao DW, Gao QW, Lu YX, Sun D, Cheng YS (2021) Consistency and diversity neural network multi-view multi-label learning. Knowl-Based Syst 218:106841
Zhu XF, Li XL, Zhang SC (2015) Block-row sparse multiview multilabel learning for image classification. IEEE Trans Cybern 46(2):450–461
Zhang CQ, Yu ZW, Hu QH, Zhu PF, Liu XW, Wang XB (2018) Latent semantic aware multi-view multi-label classification. In: Proceedings of the AAAI conference on artificial intelligence, pp 4414–4421
Zhang FW, Jia XY, Li WW (2020) Tensor-based multi-view label enhancement for multi-label learning. In: Proceedings of the international joint conference on artificial intelligence, pp 2369–2375
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:6000–6010
Lei K, Guo P, Wang Y, Wu X, Zhao WC (2022) Solve routing problems with a residual edge-graph attention neural network. Neurocomputing 508:79–98
Sun M (2023) PP-GNN: pretraining position-aware graph neural networks with the np-hard metric dimension problem. Neurocomputing 561:126848
Hu B, Guo KH, Wang XK, Zhang J, Zhou D (2021) RRL-GAT: graph attention network-driven multi-label image robust representation learning. IEEE Internet Things J 9:9167–9178
Pal A, Selvakumar M, Sankarasubbu M (2020) Multi-label text classification using attention-based graph neural network. In: Proceedings of the international joint conference on artificial intelligence, pp 494–505
Xue Z, Li GR, Huang QM (2018) Joint multi-view representation and image annotation via optimal predictive subspace learning. Inf Sci 451:180–194
Liu PF, Qiu XP, Huang XJ (2017) Adversarial multi-task learning for text classification. In: Proceedings of the meeting of the association for computational linguistics, pp 1–10
Kingma DP, Ba JA (2014) A method for stochastic optimization. In: Proceedings of the international conference on learning representations, arXiv:1412.6980, pp 273–297
Cheng YS, Li QY, Wang YB, Zheng WJ (2022) Multi-view multi-label learning with view feature attention allocation. Neurocomputing 501:857–874
Tan QY, Yu GX, Domeniconi C, Wang J, Zhang ZL (2018) Incomplete multi-view weak-label learning. In: Proceedings of the international joint conference on artificial intelligence, pp 2703–2709.
Wang YB, Zheng WJ, Cheng YS, Zhao DW (2020) Joint label completion and label-specific features for multi-label learning algorithm. Soft Comput 24(9):6553–6569
Tsoumakas G, Katakis I, Vlahavas I (2009) Mining multi-label data. Data mining and knowledge discovery handbook. Springer, Berlin, pp 667–685
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Acknowledgements
This work was supported by the National Natural Science Foundation of Anhui under Grant 2108085MF216 and Key Laboratory of Data Science and Intelligence Application, Fujian Province University (NO. D202005) and the Graduate Student Academic Innovation Project of the School of Computer and Information of Anqing Normal University 2021yjsXSCX100.
Author information
Authors and Affiliations
Contributions
QL: Conceptualization, Software, Validation, Writing-Original draft preparation; YC: Supervision, Methodology, Writing-Original draft preparation.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, Q., Cheng, Y. Multi-view Multi-label Learning with Shared Features Inconsistency. Neural Process Lett 56, 182 (2024). https://doi.org/10.1007/s11063-024-11528-w
Accepted:
Published:
DOI: https://doi.org/10.1007/s11063-024-11528-w