1 Introduction

Multi-label learning [1] is a basic framework for solving real-life problems in which an instance is associated with multiple labels. With the increase of data complexity and the diversity of extraction methods, traditional multi-label learning becomes less efficient. This is because multiple types of data features may exist while an instance is associated with a set of class labels [2,3,4]. For example, a picture of a landscape can be labeled as “sky”, “cloud” or “sunrise”. Meanwhile, the features can be extracted based on the color intuitive characteristics of the picture and the color space of three primary colors. Such a problem is called MVML learning [5].

A common approach to solving MVML classification is by extracting shared features and private features. To extract the shared features of multiple views, Zhao et al. [6] proposed a two-step approach. Firstly, Hilbert–Schmidt Independence Criterion (HSIC) [7] was used to extract the view consistency information in the mapping process. Then, MVML learning was explored under missing labels. Tan [8] et al. developed an explicit model to explore the shared subspace using non-negative matrix decomposition [9] to obtain the shared features. The method also focused on exploring the private features of different views and assigning weights to different views to enhance the classification performance. To explore the consistency and complementarity among different views, an MVML algorithm with matrix complementation [10] was proposed. This algorithm projected the information from different views into a low-dimensional space to learn the common representation. Then, it learned the reconstruction weights of each view to obtain complementary information. A variety of existing methods map the feature vectors of different dimensions of each view to a shared subspace to obtain the shared features. However, the mapping matrix of feature vectors of each view is different from each other, so it is difficult to ensure that the truly shared features are mined. To address this issue, Wu et al. [11] obfuscated the mapping process of features from different views to a shared subspace to extract shared features by minimizing the adversarial loss. Also, orthogonal constraints were applied to the obtained shared features to remove the shared features in the original features and obtained the private features.

The above methods of extracting shared features from views are actually based on the fact that shared features have the same number and degree of association in different views. The degree of association can be used to measure the correlation of features shared by different views. The fact that the number and degree of the association are inconsistent leads to the extraction of suboptimal shared features. The suboptimal shared features make the information interaction between views inaccurate, which is an indication of the deterioration of view communicability. Thus, we call this phenomenon the inconsistency of shared features. For instance, the Corel5k dataset contains information from six views, which are DenseHue, DenseSift, Gist, HSV, Lab, and RGB. The shared features were extracted using the method mentioned by Wu et al. [11]. The degree of association and the number of associations of the shared features of view 1 DenseHue and view 2 DenseSift are not the same as those of view 1 DenseHue and view 3 Gist. As shown in Fig. 1a and b, the shared features of 16 samples in these three views are randomly taken out and the degree of association of the shared features is measured by calculating the cosine similarity. The association is considered when the calculated correlation degree is greater than 0.5 for the shared features of different views. Figure 1c shows that the shared features of the 1st sample in view 1 are associated with the shared features of the 4th and 11th samples in view 2, and Fig. 1d shows that the shared features of the 1st sample in view 1 are associated with the shared features of the 4th, 7th, 9th, 13th, and 14th samples in view 3. The distance between them then indicates the degree of association.

Fig. 1
figure 1

Problem description: there is a difference in the degree of association (cosine similarity) of the shared features of different views

To solve this problem, this paper proposes the multi-view multi-label algorithm with shared features inconsistency (MMSFI-C). The framework of the algorithm is shown in Fig. 2. In the first part, the shared features and private features of multi-view are extracted. [11]. The extraction of shared features here is also based on the previously mentioned assumption, which produces the results in Fig. 1. In this paper, this problem is solved by the graph attention mechanism. In the second part, the obtained shared features are input into the graph attention network (GAT). GAT [12] is a representative graph convolutional network. By learning the weights of neighboring nodes, GAT can achieve weighted aggregation of neighboring nodes. MMSFI-C learns the association degree of features shared between different views through the graph attention mechanism. Firstly, the adjacency matrix and the attention coefficient are calculated. Then, the number of associations is determined by taking the obtained adjacency matrix as a mask matrix. Subsequently, the attention weight matrix is used to measure the degree of association of the shared features and to obtain new shared features. In the third part, the new shared features obtained are spliced with the previously extracted private features based on the feature dimension. In the last part, the final multi-label prediction is performed through a fully connected layer.

Fig. 2
figure 2

MMSFI-C framework

The main contributions of this paper are concluded as below:

  1. (1)

    An MVML learning method based on GAT is designed, which can solve the problem of extracting shared features with inconsistency.

  2. (2)

    The GAT is used to learn the association degree of features shared among different views, and the adjacency matrix and attention coefficients are calculated. The number of associations is determined by taking the adjacency matrix as a mask matrix. The attention weights are used to measure the degree of association between the shared features, and then new shared features are obtained.

  3. (3)

    The comparative experiments are performed on seven MVML datasets, and the experiment results show that MMSFI-C achieves significantly better performance.

The remainder of this paper is organized as follows. Section 2 describes the recent work of multi-view multi-label learning and graph attention mechanism; Sect. 3 describes the concrete model structure and realization method; Sect. 4 reports the detailed experimental results with the experimental analysis; Sect. 5 concludes the paper, and Sect. 6 discusses the application of multi-view multi-label classification and research directions.

2 Related Work

2.1 Multi-view Multi-label Learning

The existing MVML learning algorithms are mainly divided into two categories. The first class of algorithms performs multi-label prediction for each view and then fuses the prediction results of all views. Zhu et al. [13] developed a label embedding method based on view relevance. They used HSIC to explore the consistency information of the views and finally combined the prediction results for each view. To explore the contribution of distinct views, Huang et al. [14] constructed classifiers for each view based on label relevance and view consistency, and they assigned corresponding weights to the views according to their contributions. To further fuse the information between multi-view features and labels, Zhao et al. [15] proposed a two-stage algorithm. They extracted view-specific labels and then maximized the dependencies. Because the two-step algorithm model is suboptimal, an end-to-end single hidden layer feedforward neural network framework was proposed by Zhao et al. [16]. The performance of the multi-label classifier was improved by fully exploiting the consistency and diversity information of the views.

Another category of algorithms converts multi-view multi-label problems into single-view multi-label problems. Learning a common representation of multi-view, followed by multi-label prediction using traditional methods. Zhu et al. [5] learned label correlations between global and local; then, they used a low-rank matrix to complement incomplete views; finally, a consistent representation of views was learned to encode the complementary information between views. To avoid noise and redundancy generated by transforming a single view, a multi-view multi-label algorithm was proposed for image classification [17], which maps information from different views to a shared space while ensuring data sparsity. Zhang et al. [18] proposed an algorithm based on matrix decomposition to encode the consistency and complementarity of different views, which in turn leads to a consensus multi-view representation. However, the method cannot extract shared features of multiple views for a large number of views. Therefore, Zhang et al. [19] introduced tensor decomposition to obtain a shared space of higher-order relations. The enhancement of labels under multiple views was proposed for the first time to learn the shared features among views.

The majority of established multi-view multi-label methods use the obtained shared features for label prediction, and the importance of the shared features of the different views obtained by each of these methods is assumed to be the same. In fact, the results of Fig. 1 show that they are different. As the number and degree of association of different view features may vary, ignoring this issue can cause poor communication of views and thus adversely affect the label prediction results. Therefore, we propose an MVML framework based on GAT to solve this problem.

2.2 Graph Attention Network

The graph attention mechanism is a graph convolutional neural network [20]. It was proposed to achieve better neighborhood weight aggregation with the attention mechanism [21], by learning the neighborhood weights to weigh the summation of neighborhood node features. The graph attention mechanism has also been used to solve the NP problem in graph theory. The NP-hard problem (Nondeterministic Polynomial-time hard problem) is a class of problems in computational complexity theory. LEI et al. [22] proposed a deep learning framework that combines residual networks with graph attention mechanisms, which is highly generalizable and applicable to a wide range of problems. Sun et al. [23] combined self-supervised learning with graph convolutional neural networks to learn the minimum set of parses for the graph. Inspired by these works, we employ a graph attention mechanism to address the NP-hard problem that may arise in different datasets.

Currently, graph attention neural network plays a broad role in multi-label learning. Hu et al. [24] proposed a multi-label image learning method based on graph attention mechanism to reduce the misconnections of objects in images and avoid the influence of noise to obtain the dependencies of input objects. Pal et al. [25] used a graph attention mechanism for multi-label text classification to select important features by assigning different weights to the labels based on the correlation between the feature learning labels.

Motivated by the existing studies, this paper attempts to apply the graph structure to multi-view multi-label data. As a special data structure, the graph is a non-Euclidean space. It has two characteristics: (1) variable local input dimension, which is manifested by different numbers of neighbor nodes for each node; (2) disorderly ranking, which shows that there is only a connection between nodes without sequential order. Multi-view multi-label data has these two characteristics, as expressed by the fact that the number of neighboring features of each view feature is different and there is some degree of correlation between them. Therefore, this paper applies the graph attention mechanism to solve the problem of poor view communication caused by the existing methods in extracting shared features.

3 Proposed Approach

3.1 Problem Statement and Notations

\(F={\left\{{{\varvec{X}}}^{v}\right\}}_{v=1}^{h}\) indicates the original feature space data that includes \(h\) views. \({{\varvec{X}}}^{v}={\left[{{\varvec{x}}}_{1},{{\varvec{x}}}_{2},\dots ,{{\varvec{x}}}_{n}\right]}^{{\text{T}}}\in {\mathbb{R}}^{n\times d}\) represents the feature space under the \(v\)-th view with \(n\) samples. \({\varvec{Y}}\in {\left\{\mathrm{0,1}\right\}}^{n\times q}\) is the label space with labels of class \(q\). \({{\varvec{y}}}_{ij}=1\) indicates that the \(i\)-th sample contains \(j\) labels; \({{\varvec{y}}}_{ij}=0\) indicates the \(j\)-th label of the \(i\)-th sample does not contain information. \({\widetilde{{\varvec{y}}}}_{ij}\) indicates the output of the predicted label. Our task is to build a model to make multi-label predictions for unknown examples that contain multiple views.

3.2 Extraction of Shared and Private Features

The current MVML algorithms usually map the feature vectors from different views into a shared subspace when acquiring the features shared to multiple views [3, 10, 18, 26]. However, there is a difference in the mapping matrix of each view feature vector. As a result, this mapping method makes it uncertain whether true shared information is available because the mapping process for different views is not dependent on each other. Also, the private contributions of individual views should be considered when making multi-label predictions. We confuse the mapping process of view features to shared spaces by the minimization of adversarial loss which prevents the discriminator from determining to which view the input shared features belong [11, 27]. The features obtained by this method do not consist of private features, achieving the purpose to extract shared features.

First, the \(k\)-dimensional shared features \({{\varvec{c}}}^{v}\) of the feature space \({{\varvec{X}}}^{v}\) under the \(v\)-th view are extracted by the \(ReLu\) activation function as a shared subspace extraction layer \(f\left(\cdot \right)\). The extraction process is by \(f\left({p}^{v}\left({{\varvec{X}}}^{v}\right)\right)\), and the projection of feature vectors of different dimensions to \(k\) dimensions is described by \({p}^{v}\left(\cdot \right)\).

Next, the training set \(N=\left\{\left({{\varvec{c}}}_{i}^{v},{{\varvec{g}}}_{i}\right)|1\le v\le h,1\le i\le n\right\}\) is constructed, and \({{\varvec{g}}}_{i}\) represents the \(h\) dimensions view label vector of \({{\varvec{c}}}_{i}^{v}\). The discriminator \(D\left(\cdot \right)\) is used for discrimination, and its value is 1 if \({{\varvec{c}}}_{i}^{v}\) belongs to the \(v\)-th view; otherwise, its value is 0. The value of the output is denoted as \({\widetilde{{\varvec{g}}}}_{i}\), \({\widetilde{{\varvec{g}}}}_{i}=D\left({{\varvec{c}}}_{i}^{v}\right)\). The loss \({l}_{adv}\) for this part of can be written as:

$${l}_{adv}=\ell \left(-\sum_{i=1}^{n}\sum_{v=1}^{h}{{\varvec{g}}}_{i}^{v}{\text{log}}{\widetilde{{\varvec{g}}}}_{i}^{v}\right)$$
(1)

where \(\ell \left(\cdot \right)={e}^{-x}\). However, the extraction of shared features following the above method may encounter a problem since noise that does not include semantic information may also confuse the discriminator. This paper adds the shared subspace multi-label loss \({l}_{sml}\) in order to ensure that the input information contains semantics. The training set \(M = \left\{ {\left( {{\varvec{c}}_{i}^{v} ,{\varvec{y}}_{i} } \right)\left| {1 \le v \le h,1 \le i \le n} \right.} \right\}\) is constructed. The output value after the prediction layer \(P\) is denoted as\({\widetilde{{\varvec{y}}}}_{i}{\prime}\),\({\widetilde{{\varvec{y}}}}_{i}{\prime}=P\left({{\varvec{c}}}_{i}^{v}\right)\). The loss term \({l}_{sml}\) for this part can be expressed as:

$${l}_{sml}= -{\sum }_{i=1}^{n}{\sum }_{j=1}^{q}\sum_{v=1}^{h}{y}_{ij}{\text{log}}{\widetilde{y}}_{ij}^{v{\prime}}+\left(1-{y}_{ij}\right){\text{log}}\left(1-{\widetilde{y}}_{ij}^{v{\prime}}\right)$$
(2)

The loss term \({l}_{shared}\) for shared features is expressed in Eq. (3):

$${l}_{shared}={l}_{adv}+{l}_{sml}$$
(3)

MMSFI-C removes shared features out of the original features, and remaining features are private features. This is accomplished by imposing orthogonal constraints on the extracted shared features. The \(k\)-dimensional private features \({{\varvec{q}}}^{v}\) of the feature space \({{\varvec{X}}}^{v}\) under the \(v\)-th view are extracted by private space extraction layer \(w\left(\cdot \right)\) consisting of a fully connected layer and a \(ReLu\) activation function, \({{\varvec{q}}}^{v}=w\left({{\varvec{X}}}^{v}\right)\). \({\varvec{C}}\) is a \(k\)-dimensional feature vector that includes features shared by all views, i.e., \({\varvec{C}}=\sum_{v=1}^{h}{{\varvec{c}}}^{v}\). The loss term \({l}_{private}\) for private features is expressed as:

$${l}_{private }={\Vert {{{\varvec{q}}}^{v}}^{{\text{T}}}{\varvec{C}}\Vert }_{2}^{2}$$
(4)

3.3 Shared Features Based on Graph Attention Mechanism

For the shared features obtained in Sect. 3.2, the degree of association between the shared features of different views cannot be obtained, i.e., each associated view is treated equally. However, different association views produce different roles in label prediction. To solve this problem, the graph attention mechanism is adopted in this paper.

The graph attention mechanism characterizes the top and bottom representation vectors of the current node by aggregating the feature vectors of the current node with those of its related nodes in an average manner. This means that the neighborhood nodes are aggregated through a self-attention mechanism to achieve adaptive matching of weights with different neighborhood nodes.

Inspired by this, our study characterizes the feature vectors of different views to reduce the high dimensionality of multi-view multi-label. As mentioned before, the number of associations of features is not the same between different views. In graph theory, the degree of a node is the number of nodes of adjacent nodes, which is similar to the number of neighboring view features of a view feature in a multi-view shared feature. Thus, this paper adopts the graph attention mechanism to aggregate the associated views to adaptively match the weights of different associated views, thus enhancing the communication between views.

The shared features \({\varvec{C}}\) are taken as the input of the graph attention layer, and a weight matrix \({\varvec{W}}\in {\mathbb{R}}^{k\times k}\) is used to act on each feature. Then, the attention coefficients from the shared features of different views are calculated using self-attention. \(a\left(\cdot \right)\) denotes the self-attention mechanism. \({e}_{ij}\) represents the importance of view \(j\) to view \(i\), and it can be expressed as:

$${e}_{ij}=a\left({\varvec{W}}{{\varvec{c}}}^{i},{\varvec{W}}{{\varvec{c}}}^{j}\right)$$
(5)

Based on \({\varvec{W}}\) and \(a\left(\cdot \right)\), the model can learn the attention coefficients between view \(i\) and view \(j\). Then, the obtained attention coefficients are non-linearized as follows:

$${e}_{ij}=LeakyReLU\left(a\left({\varvec{W}}{{\varvec{c}}}^{i},{\varvec{W}}{{\varvec{c}}}^{j}\right)\right)$$
(6)

Finally, the attention coefficients are normalized by \(softmax\) with the following expression:

$${a}_{ij}=softmax\left({e}_{ij}\right)=\frac{exp\left({e}_{ij}\right)}{{\sum }_{j}^{v}exp\left({e}_{ij}\right)}$$
(7)

To improve the generalization ability of the attention mechanism, this paper uses multi-head attention layers. \(H\) groups of self-attention layers that are independent of each other are used, and then their results are combined. The shared features \({{\varvec{s}}}^{i}\) of the \(i\)-th view of the output are denoted as:

$${{\varvec{s}}}^{i}=\sigma \left(\frac{1}{H}\sum_{h=1}^{H}\sum_{j}^{v}{a}_{ij}^{h}{{\varvec{W}}}^{h}{{\varvec{c}}}^{i}\right)$$
(8)
$${\varvec{A}}={\varvec{G}}{\varvec{C}}{{\varvec{C}}}^{{\text{T}}}$$
(9)
$${\varvec{G}}={\text{diag}}\left({d}_{1},{d}_{2},\cdots {,d}_{h}\right)$$
(10)
$${{\varvec{d}}}_{{\varvec{v}}}=\sum_{{\varvec{i}}=1}^{{\varvec{k}}}{{\varvec{c}}}_{{\varvec{i}}}^{{\varvec{v}}},{\varvec{v}}\in \left\{\mathrm{1,2},\cdots ,{\varvec{h}}\right\}$$
(11)

where Eq. (9) is used to calculate the adjacency matrix \({\varvec{A}}\) of the originally shared features, and \({\varvec{A}}\) is used as a mask matrix. Eqs. (10) and (11) are the processes of calculating the diagonal matrix \({\varvec{G}}\). The activation function \(\sigma \left(\cdot \right)\) determines whether the attention coefficient \({a}_{ij}\) of the current view should be retained by this mask matrix, the diagonal matrix of shared features are used in Eq. (10). Here, \(H\) is set to the number of views \(h\) so that more information about the views can be aggregated. The dropout layer is used to improve the generalization ability of the model especially on small-size datasets. Adding a dropout layer is essentially a random sampling of the associated views. The new shared features of the output are denoted as \({\varvec{S}}=\sum_{i=1}^{h}{{\varvec{s}}}^{i}\). These features are obtained by fusing the amount of association and the degree of association of the view features.

3.4 Label Prediction

\(P\left(\cdot \right)\) Is the fully connected layer, which is used as the final label prediction layer in this paper. The dimension of the input matrix is \(n\times \left(h+1\right)k\), and the dimension of the output matrix is \(n\times q\). As shown in Eq. (12), \({{\varvec{P}}}_{final}\) indicates the feature acquired by merging the shared features with the private features based on the feature dimension. \({{\varvec{T}}}_{out}\) denotes the output result of the prediction layer \(P\left(\cdot \right)\) to \({{\varvec{P}}}_{final}\). \({{\varvec{T}}}_{pre}\) in Eq. (14) indicates the labeling judgment of a certain sample by the symbolic function, and the meaning of 1 and 0 values output by the symbolic function in Eq. (15) is the same as that in Sect. 3.1.

$${{\varvec{P}}}_{final}=\boldsymbol{ }[{{\varvec{q}}}^{1},{{\varvec{q}}}^{2},\cdots {,{\varvec{q}}}^{h},{\varvec{S}}]{\in {\mathbb{R}}}^{n\times \left(h+1\right)k}$$
(12)
$${{\varvec{T}}}_{out}=P\left({{\varvec{P}}}_{final}\right)$$
(13)
$${{\varvec{T}}}_{pre}=sign\left({{\varvec{T}}}_{out}\right)$$
(14)
$$sign\left(x\right)=\left\{\begin{array}{cc}1& if x-0.5>0\\ 0& else\end{array}\right.$$
(15)

3.5 MMSFI-C

In this paper, a graph attention-based neural network framework called MMSFI-C is designed. The framework consists of two main parts: extracting shared and private features of multi-view, and using the graph attention mechanism to obtain new shared features considering the existence of non-consistency in the shared features. Finally, the obtained shared private features are combined based on the feature dimension for multi-label prediction.

The total loss function of MMSFI-C can be expressed as:

$$L={l}_{ml}+\lambda {l}_{shared}+\gamma {l}_{private}$$
(16)

where \(\lambda \),\(\gamma \) are the weight parameters that control the loss term; \({l}_{shared}\) denotes the loss value when shared features are extracted; \({l}_{private}\) denotes the loss value when private features are extracted; \({l}_{ml}\) handles the multi-label loss of the model, and it is calculated as follows:

$${l}_{ml}=-{\sum }_{i=1}^{n}{\sum }_{j=1}^{q}{y}_{ij}{\text{log}}{\widetilde{y}}_{ij}+\left(1-{y}_{ij}\right){\text{log}}\left(1-{\widetilde{y}}_{ij}\right)$$
(17)

The Adam optimization method [28] is used for the MMSFI-C algorithm. In the following, an overview of the proposed algorithm forward propagation process is presented.

Algorithm
figure a

Multi-view and multi-label learning with shared features inconsistency (MMSFI-C)

3.6 Complexity Analysis

The complexity of the algorithm MMSFI-C is divided into two main parts, which are feature extraction and measuring the association degree of features shared by various views through the graph attention mechanism. The complexity is \(\mathcal{O}\left(hn{d}^{2}k+n{k}^{3}\right)\) when extracting the shared features and \(\mathcal{O}\left(hn{d}^{2}k+h{n}^{2}k\right)\) in extracting the private features. The complexity of measuring the association degree of features shared by various views through the graph attention mechanism is \(\mathcal{O}\left(h{n}^{4}{k}^{2}+h{n}^{4}k\right)\). Hence, the final complexity is \(\mathcal{O}\left(nk\left(2h{d}^{2}+{k}^{2}+hn+h{n}^{3}k+h{n}^{3}\right)\right)\), which is mainly affected by the size of the sample size \(n\). The time cost of MMSFI-C on the seven datasets mentioned in Sect. 4.1 is shown in Table 1. As can be seen from the table, the algorithm takes less than 5 s on Emotions, the dataset with the smallest sample number, and about 30 min on Mirflickr, the dataset with the largest sample number.

Table 1 The actual runtime of the algorithm

4 Experiments

4.1 Datasets

The experiments have been performed on seven MVML datasets in order to verify the performance of the algorithms. These datasets were sourced from MULAN1 and LEAR2 and the specific information is shown in Table 2. Figure 3 shows the example images from the Pascal07 dataset.

Table 2 Multi-view multi-label datasets information
Fig. 3
figure 3

Sample images from the Pascal07 dataset

4.2 Comparing Algorithms

To verify the performance of MMSFI-C, seven other algorithms are adopted for comparison. The algorithms are presented as below.

  1. 1.

    SIMM1 [11]: It is a neural network framework proposed to learn shared features and private features among multi-view for multi-label prediction. Here, \(\alpha \) is taken to be 1, and \(\beta \in \left[{10}^{-4},{ 10}^{-1}\right]\).

  2. 2.

    MMFA2 [29]: This algorithm discusses the importance of each view feature to label prediction in the MVML problem. \(\lambda , \gamma \in \left[{10}^{-4},{10}^{4}\right]\).

  3. 3.

    ICM2L3 [8]: It explores individual and shared features of MVML data and explores rare labels. Among this algorithm, \(\alpha \in \left[ 0.1, 1\right]\), \(\beta \in \left[0.1, 2\right]\).

  4. 4.

    iMvWL4 [30]: It learns shared representations with weak labels, local label relevance, and incomplete views. Among this algorithm, \(\alpha \), \(\beta \in \left[{10}^{-5},{ 10}^{0}\right]\).

  5. 5.

    VLSF5 [14]: It learns label-specific view features and combines these features according to view contributions for the final prediction. In this algorithm, \({\lambda }_{1}\in \left[{10}^{-7},{ 10}^{-1}\right]\), \({\lambda }_{2}\in \left[{10}^{-1},{ 10}^{5}\right]\), \({\lambda }_{3}\in \left[{10}^{4}, { 10}^{6}\right]\), and \({\lambda }_{4}\in \left[{10}^{3},{ 10}^{8}\right]\).

  6. 6.

    CDMM6 [16]: It is a neural network framework without iteration which is used for MVML learning to explore the consistency and variety of view data. Among this algorithm, \(\alpha \in \left[{10}^{-10},{ 10}^{-5}\right]\), \(\lambda \in \left[{10}^{1},{ 10}^{ 7}\right]\), \(C\in \left[{10}^{-5},{ 10}^{5}\right]\), \(\sigma \in \left[{10}^{-2},{ 10}^{2}\right]\), and \(\eta \in \left[0.5, 0.8\right]\).

  7. 7.

    TM3L7 [6]: It is a two-step algorithm based on subspace learning to deal with the missing label problem in multi-view multi-label learning. In this algorithm, \(\alpha ,C,\lambda \in \left[{10}^{-5},{ 10}^{5}\right]\), and \(\beta \) is set to 1.

  8. 8.

    MMSFI-C8: In our algorithm, \(\lambda , \gamma \in \left[{10}^{-4},{ 10}^{4}\right]\).

1 code: http://palm.seu.edu.cn/zhangml/

2 code: https://github.com/chengyshaq/MMFA

3 code: http://mlda.swu.edu.cn/codes.php?name=ICM2L.

4 code: http://mlda.swu.edu.cn/codes.php?name=iMvWL.

5 code: http://www.escience.cn/people/huangjun/index.html.

6 code: https://github.com/chengyshaq/CDMM

7 code: https://github.com/chengyshaq/TM3L

8 code: https://github.com/chengyshaq/MMSFI-C

4.3 Evaluation Metrics

This paper uses six common evaluation metrics to evaluate the performance of the above algorithms, namely Hamming Loss (HL), Average Precision (AP), One Error (OE), Ranking Loss (RL), Coverage (CV), and Subset Accuracy (SA). HL examines the misclassification of samples on a single label; OE indicates the case where the topmost label is not part of the set of relevant labels; CV indicates the search depth required to cover all relevant labels, and RL examines the case where there is a sorting error in the sorting sequence of the category labels of the samples. The smaller the value of the four metrics, the better the performance. Besides, higher values of AP and SA indicate improved performance. The details of the calculation of these evaluation metrics with reference to [31, 32].

4.4 Experimental Results and Analysis

The experiments are performed on seven MVML datasets. For each algorithm, five-fold cross-validation is performed, and the six evaluation metric values are recorded. In Table 3, the experimental results are presented as mean ± standard deviation, and all optimal values are marked in bold.

Table 3 Experimental results of the eight algorithms on seven datasets

It can be seen from Table 3 that MMSFI-C achieves the best performance on most of the datasets. It achieves the second-best performance on the Yeast dataset because this dataset is of the biological type, and it contains more diversity information. CDMM targets at processing the datasets containing diversity information, while MMSFI-C targets at processing the datasets containing more consistency information. As shown in Table 2, except Yeast, the other six datasets are image and audio types, and these data contain more consistency information. Therefore, MMSFI-C is superior to other algorithms on these six datasets. This performance superiority reflects that MMSFI-C extracts shared features, considers the communicative nature between views, and assigns attention weights to the views.

To further compare the performance of the algorithms, a statistical hypothesis test is performed. First, the Friedman [33] test is conducted to determine whether the performance of all algorithms differs. Table 4 lists the results at a significance level of 0.05 at each evaluation metric. It can be seen from this table that the performance of the eight algorithms is significantly different, and the original hypothesis is rejected.

Table 4 Summary of the Friedman statistics \({F}_{F}\) (\(k\) = 8, \(N\) = 7) and the critical value in terms of each evaluation metric (\(k\): Number of comparing algorithms; \(N\): Number of datasets)

Next, the Nemenyi test is conducted to compare the performance of each algorithm. At the confidence level \(\alpha =0.05\), \(k=8\), \(N=7\), \({q}_{\alpha }=3.031\), and the critical difference \(\left({\text{CD}}\right)\) value of 3.9685, the test results are shown in Fig. 4, where each subplot presents the experimental results under the six evaluation metrics. Two algorithms are considered to be significantly different if their difference reaches above a critical value.

Fig. 4
figure 4

The Nemenyi test result of each evaluation metric

$${\text{CD}}={q}_{\alpha }\sqrt{\frac{k\left(k+1\right)}{6N}}$$
(18)

In each subplot, the average ranking differences between algorithms within \({\text{CD}}\) values are connected by colored solid lines, and the performance of these algorithms shows a sequential decreasing trend from left to right. Except for the SA, MMSFI-C is on the far left in the remaining subplots, indicating that MMSFI-C achieves the best performance for most of the evaluation metrics.

The following observations are obtained from the above experiment results:

  1. (1)

    Compared with VLSF and CDMM, MMSFI-C produces significantly superior results on the seven datasets. It is due to the fact that VLSF and CDMM only assign weights to views based on label prediction results, while MMSFI-C measures the association degree of shared features among various views and assigns attention weights based on the measurement.

  2. (2)

    All of MMSFI-C, MMFA, SIMM, ICM2L, TM3L and iMvWL use shared subspace to extract the shared features, but MMSFI-C, MMFA, SIMM outperform the other three algorithms. This is because the shared features obtained by MMSFI-C, MMFA, SIMM through the minimization of adversarial losses are more realistic and reliable.

  3. (3)

    MMSFI-C outperforms SIMM and MMFA because it considers the inconsistency of shared features and uses the graph attention mechanism to overcome the drawbacks of the traditional method in extracting shared features.

  4. (4)

    MMSFI-C performs the best in five evaluation metrics and is inferior to CDMM in SA. This is because CDMM is good at mining rare labels, while MMSFI-C cannot.

  5. (5)

    MMSFI-C, MMFA, CDMM, and SIMM all explore the shared features and private features among views, and they have better performance than the other algorithms. It can be seen that fully exploiting the information from different views is the key to solving the MVML problem, and measuring the association degree of shared features is conducive to improving the MVML algorithm.

4.5 Component Analysis

For the MMSFI-C algorithm, a component analysis is performed to evaluate the effectiveness of the part that learns the degree of association between features shared by different views through the graph attention mechanism. In this paper, the algorithm for removing part II in Fig. 2 is called MMSF and was experimented on seven datasets. The experimental results of MMSF are compared with those of MMSFI-C, and the comparison is shown in Fig. 5. It can be seen that MMSFI-C performs better in each evaluation metric on all datasets. This further illustrates that learning the degree of association between features shared by different views in multi-view multi-label learning can effectively improve the performance of the classifier.

Fig. 5
figure 5

Comparison of experimental results between MMSFI-C and MMSF under six evaluation indexes

4.6 Parameter Sensitivity Analysis

The two parameters \(\lambda \) and \(\gamma \) are used in MMSFI-C, where \(\lambda \) controls the number of shared features, and \(\gamma \) controls the number of private features. To test the sensitivity of the parameters, experiments are conducted on the Emotions dataset, and the experimental results are shown in Fig. 6. The values of \(\lambda \) and \(\gamma \) fall in the range from \({10}^{-4}\) to \({10}^{4}\). MMSFI-C performs well in all evaluation metrics when \(\lambda , \gamma \in \left[{10}^{-4},{10}^{-2}\right]\). As the value of \(\lambda \) becomes smaller, the performance decreases accordingly. This is because the graph attention weights learned by the algorithm decay in this case. When the value of \(\gamma \) is too large, the algorithm also performs poorly because the view communicability decreases when there are too many private features. The experiments are conducted on other datasets as well, and similar experimental results are obtained.

Fig. 6
figure 6

Parameter sensitivity analysis on the Emotions dataset

5 Conclusion

This paper investigates the inconsistency of features shared by different views. The previous algorithms presume that the amount of association and the association degree of features are the same between different views, and they ignore the communicability of different views. In fact, the number and the degree of associations among shared features are not exactly the same. To this end, this paper proposes an algorithm called MMSFI-C to solve the above issue. First, shared and private features between views are extracted. Next, the graph attention mechanism is adopted to learn the association degree of features shared by various views and calculate the adjacency matrix and attention coefficients. The adjacency matrix is used as a mask matrix to determine the final attention weight matrix. Subsequently, the new shared features are obtained by using the obtained attention weights to measure the association degree of the shared features from various views. Finally, the private features are combined with the new shared features for multi-label prediction.

In this paper, experiments are conducted on seven MVML datasets and seven state-of-the-art MVML algorithms are selected for comparative analysis with MMSFI-C. The results of label prediction are evaluated under six evaluation metrics, and the experimental results demonstrate the rationality and effectiveness of our algorithms.

6 Discussion

Multi-view multi-label classification finds extensive applications in fields such as multimedia data processing, bioinformatics, and recommendation systems. For example, in recommendation systems, algorithms can extract multiple interest-view features from users and perform multi-label classification on users. Although there is some correlation among the multiple interest views of users, the degree of correlation varies. Therefore, if the inconsistency among different views is ignored, and their shared features are directly extracted for multi-label prediction, it can impact the accuracy of prediction results. Based on this, the model MMSFI-C is developed to solve this problem.

In this paper, we examine the shortcomings of existing methods for extracting shared features (which do not take into account the non-consistency of features shared by different views) and compensate for this by utilizing the graph attention mechanism. However, this paper processes the shared features after they have been extracted and does not address this issue from the perspective of extracting the features themselves. In the future this will be investigated from this perspective.

Another issue to consider is that the extraction of shared and private features in previous research work ignores the location information of the features, i.e., it is not stated from which view these features come from, which in turn may affect the multi-label classification results. The initial idea is to construct a location information matrix to solve this problem.