Introduction

In many real-world applications, such as bioinformatics and video annotation, obtaining labeled data is sometimes very difficult, expensive and time-consuming. On the other hand, it may be simple and inexpensive to obtain unlabeled data. For instance, vast numbers of videos and images are available on the web. The large amount of unlabeled data can reveal useful information about the phenomena we are studying, e.g., estimating the distribution of the data as well as the data structure [68]. As a result, Semi-Supervised Learning (SSL) is drawing increasing interest in the machine-learning community [10].

Studies on SSL are extensive (e.g. [2, 4, 12, 13, 32, 45, 51, 62, 66]); detailed reviews may be found in [65] and [42]. The common purpose of semi-supervised algorithms is to exploit both labeled data and unlabeled data to create superior classifiers compared to labeled data alone. According to [10], self-training (also known as self-learning or self-labeling) is among the earliest approaches that use unlabeled data in classification. The idea of the self-training first appeared in [41]. In self-training, a classifier is first trained only with the labeled data, and then used to predict labels for some unlabeled data. Then, the classifier is re-trained with both the ground-truth and predicted labels, and used to predict additional labels. The process repeats until all examples are labeled. The authors in [42] use the expectation-maximization (EM) algorithm [14] for SSL. Co-Training [6] is a learning paradigm to address problems with strong structural prior knowledge available, and is regarded as a variant of EM on the probabilistic model [10, 42]. It assumes that features can be split into two complementary and independent feature subsets and each feature subset is enough to train a classifier for the data. Then, each classifier uses its most confidently predicted points and their labels to teach the other classifier. The process of using the other classifier’s most confidently predicted labels to teach itself is iterated until some criteria is achieved. Transductive learning is another approach, based on the idea of performing predictions only for test samples [10]; Transductive Support Vector Machines (TSVM) are one example [54]. Various extensions to the TSVM have been proposed [9, 11, 16, 60]; the common point is that the algorithms try to learn a hyperplane over the labeled data and the unlabeled data by optimizing a tradeoff between maximizing the margin over the labeled data and regularizing the decision boundary over low-density regions of all data samples.

Graph-based algorithms are an important sub-class of SSL that have recently attracted considerable attention [10, 48, 49]. Various graph-based SSL algorithms have been developed [3, 5, 25, 28, 53, 55, 56, 59, 64, 67] and a number of successful applications can be found in recent publications [1, 29, 30, 61]. Some popular graph-based algorithms include Local and Global Consistency [64], Gaussian Random Fields and Harmonic Functions [67], mincuts [5], greedy max-cut [55], and spectral graph transducers [28]. All the graph-based algorithms begin by constructing a graph with nodes representing data points, and edges representing similarity between the connected nodes. The labeled data points are then used to perform graph clustering or propagate labels from labeled points to unlabeled points, by minimizing the empirical cost over labeled data and regularizing the smoothness over the graph using all the data. Another representative SSL approach is manifold regularization [3], which assumes data points lie on a low-dimensional manifold in the input space [20, 35, 50].

At the same time, most above semi-supervised classification algorithms implicitly assume that class labels are mutually exclusive. However, in many application domains, such as image classification, bioinformatics and news categorization, each instance can represent more than one concept simultaneously; this is best represented as a vector of labels. In addition, human emotions and sentiments are sometimes regarded as a multi-label classification problem nowadays, e.g., multiple fine-grained emotions may coexist in a single tweet of a microblog [21]. In addition, multi-label classifiers have recently been utilized for recognizing crop diseases in agriculture [27]. The learning algorithms for these problems are the “multi-label classifiers” as reviewed in [47, 58]. For instance, a well-known multi-label classifier is the Multi-Label k Nearest Neighbors (MLkNN) [57], which is an extension of the classical kNN method. References [31, 37], and [39] study a variety of supervised multi-label algorithms and present extensive experiments to compare their performances.

Our focus in the current paper is the intersection of these two problems, to wit, the design of semi-supervised multi-label classifiers. There is relatively less work in the literature on this sub-problem, and a particular dearth of graph-based semi-supervised algorithms for the multi-label case. Some existing studies on semi-supervised algorithms include the Multi-Label Gaussian Fields and Harmonic Functions (ML-GFHF) [56], the Multi-Label Local and Global Consistency (ML-LGC) [56], the Fixed-Size Multi-Label Regularized Kernel Spectral Clustering (ML-FSKSC) [33], and the Semi-Supervised Weak-Label approach (SSWL) [18]. In spite of these results, the opportunities in this area are extensive. Better methods are needed for semi-supervised multi-label classification in many tasks.

In our previous work [29], we found that a multi-label extension of the Manifold Regularization algorithm [3] was quite effective for non-intrusive load monitoring. In the current paper, we seek to improve upon that algorithm, and determine how well our results generalize beyond that domain. We investigate a multi-label extension of the Manifold Regularization (MR) algorithm, augmented with a reliance weighting strategy to further improve classification performance. Reliance weights allow learning algorithms to differentiate between ground-truth and induced labels in constructing a classifier for a given data set. They take the form of an additional matrix term in the kernel expansion of the Laplacian Regularized Least Squares model learned in MR [3]. We evaluate our proposed algorithm in comparison with five other multi-label algorithms (four semi-supervised algorithms plus MLkNN), on a set of four benchmark data sets.

The key contributions of this work are:

  • The manifold regularization algorithm is extended to learn multi-label classifiers.

  • A weighting strategy is proposed to vary the trust placed in labeled and unlabeled instances when forecasting labels for unseen points.

  • The proposed approach is compared against four semi-supervised, and one fully supervised, multi-label algorithms, and performs as well as or better than all of them.

The advantages of the proposed method are threefold: (1) the proposed method performs as well or better than the existing semi-supervised multi-label algorithms on the four data sets in the fifth section. It furthermore outperforms the state-of-the art supervised multi-label algorithms (which of course are trained on fully labeled data), even when a substantial portion of the training set is unlabeled. (2) The proposed method has a low model complexity as the Manifold Regularization [3] assumes data points lie on a low-dimensional manifold in the input space. (3) The proposed reliance weighting strategy allows an analyst to specify different trust levels for ground-truth and induced labels. The disadvantage of the method mainly lies in the computational time required for the construction of the graph structure; this is a common problem in this class of algorithms.

The remainder of this paper is organized as follows: the next section presents the preliminaries, including introducing the basis and notations, regularization in reproducing Kernel Hilbert space and manifold regularization. The third section presents the proposed approach, including graph construction, manifold regularization with multiple labels and our reliance weighting strategy. The fourth section describes the experimental design including introducing the data sets, experimental setup, performance metrics and statistical significance tests. The fifth section presents our experimental results and discussion, and we offer a summary and discussion of future work in the last section.

Preliminaries

This section presents the notations and basics that are used throughout the paper, and reviews the manifold regularization algorithm.

Basics and notations

In the framework of semi-supervised learning, the data set \(\mathbb {D}\) in the training phase consists of two parts, namely \(\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u\), where \(\mathbb {D}_l\) and \(\mathbb {D}_u\) indicate the labeled and unlabeled training data sets, respectively. Both \(\mathbb {D}_l\) and \(\mathbb {D}_u\) are drawn from the same distribution \(p(\mathbf {x})\), where \(\mathbf {x}\) indicates a feature variable. In the single label case, the feature space and label space of a data set \(\mathbb {D}\) are denoted by \(\mathcal {X}=\mathbb {R}^d\) and \(\mathcal {Y}=\{-1,1\}\), respectively. Then, the labeled and unlabeled training data sets are represented by \(\mathbb {D}_l=\{(\mathbf {x}_i,y_i):\mathbf {x}_i\in \mathcal {X},y_i\in \mathcal {Y}, i=1,2,\ldots ,l\}\) and \(\mathbb {D}_u=\{\mathbf {x}_i: \mathbf {x}_i\in \mathcal {X}, i=l+1,l+2,\ldots ,l+u\}\), where l and u indicate the numbers of labeled and unlabeled instances \(\mathbf {x}_i=[x_{i1},x_{i2},\ldots ,x_{id}]^T\) for \(i=1,2,\ldots ,n\), where d indicates the feature dimension. The total number of all training instances in \(\mathbb {D}\) is \(n=l+u\). The goal of semi-supervised learning with single label is to infer the labels \({\tilde{Y}}=\{{\tilde{y}}_i\in \mathcal {Y},i=1,2,\ldots ,e\}\) for future instances \(\mathbb {D}_e= \{{{\tilde{\mathbf{x}}}}_i \in \mathcal {X},i=1,2,\ldots ,e\}\) given the training data set \(\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u\). [49, 68]

In the multi-label case, the label space of \(\mathbb {D}\) is denoted by \(\mathcal {Y}=\{-1,1\}^L\), where L indicates the number of labels. Analogously, the labeled training data set becomes \(\mathbb {D}_l= \{(\mathbf {x}_i, \mathbf {y}_i): \mathbf {x}_i\in \mathcal {X},\mathbf {y}_i\in \mathcal {Y}, i=1,2,\ldots ,l\}\) and the label vector is \(\mathbf {y}_i=[y_{i1},y_{i2},\ldots ,y_{iL}]^T\), whereas the other notations remain the same as the single label case. The goal of semi-supervised learning with multiple labels is to infer the labels \( {{\tilde{\mathbf{Y}}}}=\{{{\tilde{\mathbf{y}}}}_i\in \mathcal {Y},i=1,2,\ldots ,e\}\) for \(\mathbb {D}_e=\{{{\tilde{\mathbf{x}}}}_i\in \mathcal {X},i=1,2,\ldots ,e\}\) given \(\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u\).

Using the graph-based semi-supervised learning, a crucial step is to construct a graph \(\mathcal {G}=(V,E)\) representing the connections between training instances \(\mathbf {x}_i\in \mathcal {X}\) [49, 56, 68]. Specifically, \(\mathcal {G}=(V,E)\) has n vertices \(V_i\) and each vertex \(V_i\) represents an instance \(\mathbf {x}_i,i=1,2,\ldots ,n\). \(E_{ij}\) is an edge connecting vertices \(V_i\) and \(V_j\). There are three typical methods to construct such a graph, including the k nearest neighbor algorithm, \(\varepsilon \) distance measure and full connection. For example, using the k nearest neighbor algorithm, each edge \(E_{ij}\) connects the vertices \(V_i\) and \(V_j\) if vertex \(V_i\) is among the k nearest neighbors of vertex \(V_j\), or vertex \(V_j\) is among the k nearest neighbors of vertex \(V_i\). A weight matrix \(\mathbf {W}\) is defined over the graph \(\mathcal {G}=(V,E)\), where \(W_{ij}\) is the weight associates with edge \(E_{ij}\) representing the similarity between vertices \(V_i\) and \(V_j\) (namely the training instances \(\mathbf {x}_i\) and \(\mathbf {x}_j\)). Then, the unnormalized graph Laplacian is given by \(\mathbf {L} = \mathbf {D}-\mathbf {W}\), where \(\mathbf {D}\) is a diagonal matrix with \(D_{ii}=\sum _{j=1}^{N} W_{ij}\).

The label inference in graph-based SSL is usually based on two graph assumptions [56, 68]: (1) the prediction should be close to the given labels on labeled vertices; (2) the prediction should be smooth on the whole graph (i.e., vertices that are close in the graph tend to have the same labels). The label inference algorithms for graph-based SSL can be categorized into two major classes: transductive learning (e.g., the graph Laplacian regularization [64, 67]), and inductive learning (e.g., the manifold regularization [3]). Transductive learning infers labels only on the unlabeled training data and cannot make predictions on out-of-sample data. By contrast, inductive learning infers labels for the whole domain, i.e., a function \(f:\mathcal {X}\rightarrow \mathcal {Y}\) is learned given \(\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u\) and then the labels for \(\mathbb {D}_e\) are predicted. The work in this paper is based on the manifold regularization [3], which is a typical inductive learning method [63]. The next subsection revisits regularization in a reproducing kernel Hilbert space, which is the core of manifold regularization.

Regularization in reproducing kernel Hilbert space

For a Mercer kernel \(K:\mathcal {X}\times \mathcal {X} \rightarrow \mathbb {R}\), there exists an associated Reproducing Kernel Hilbert Space (RKHS) \(\mathcal {H}_K\) of functions \(\mathcal {X} \rightarrow \mathbb {R}\) with the norm \(||\cdot ||_K\) [40]. The standard supervised learning estimates an unknown function \(f\in \mathcal {H}_K\) from the labeled data set \(\mathbb {D}_l\) as

$$\begin{aligned} f^*=\mathop {\mathrm{arg\,min}}\limits _{f\in \mathcal {H}_K }\frac{1}{l}\sum _{i=1}^{l}V(\mathbf {x}_i,y_i,f)+\gamma _A ||f||_K^2, \end{aligned}$$
(1)

where \(V(\mathbf {x}_i,y_i,f)\) is the loss function, such as the squared error loss \((y_i-f(\mathbf {x}_i))^2\) for regularized least squares (RLS). \(||f||_K^2\) is a regularization term in the RKHS imposing the smoothness condition on possible solutions. \(\gamma _A\) balances the tradeoff between the empirical cost and the regularization term. l is the number of labeled instances.

The difference between semi-supervised learning to supervised learning lies in the utilization of the marginal distribution of \(\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u\) to improve the learning performance in addition to the empirical cost obtained over the labeled data set \(\mathbb {D}_l\). According to the discussions in [3], there is an identifiable relation between marginal distribution \(p(\mathbf {x})\) and conditional distribution \(p(y|\mathbf {x})\), i.e., if two instances \(\mathbf {x}_i,\mathbf {x}_j\in \mathcal {X}\) are close in the intrinsic geometry of \(p(\mathbf {x})\), then their conditional distributions \(p(y|\mathbf {x}_i)\) and \(p(y|\mathbf {x}_j)\) are similar. Thus, another regularization term can be added to ensure that the solution is smooth with respect to the marginal distribution \(p(\mathbf {x})\). Incorporating the smoothness penalty term with respect to the graph Laplacian \(\mathbf {L}\), we derive the following optimization problem [3]:

$$\begin{aligned} f^* =\mathop {\mathrm{arg\,min}}\limits _{f\in \mathcal {H}_K }\frac{1}{l}\sum _{i=1}^{l}V(\mathbf {x}_i,y_i,f)+\gamma _A ||f||_K^2+\frac{\gamma _I}{n^2}\mathbf {f}^T\mathbf {L}\mathbf {f},\nonumber \\ \end{aligned}$$
(2)

where \(\mathbf {f}=[f(\mathbf {x}_1),f(\mathbf {x}_2),\cdots ,f(\mathbf {x}_n)]^T\), and \(\mathbf {f}^T\mathbf {L}\mathbf {f}\) is a penalty term that reflect the intrinsic structure of the probability distribution \(p(\mathbf {x})\). \(n=u+l\) is the number of total instances. The normalizing coefficient \(\frac{1}{n^2}\) is the natural scale factor for the empirical estimate of the Laplace operator. Coefficients \(\gamma _A\) and \(\gamma _I\) controls the complexity of the function in the ambient space and the intrinsic geometry of the \(p(\mathbf {x})\) respectively. In real-world data sets, \(p(\mathbf {x})\) is unknown, but an empirical estimate can be obtained from a sufficiently large amount of unlabeled data \(\mathbb {D}_u\) by assuming the data set lies on a manifold in \(\mathbb {R}^d\) and modeling the manifold with the adjacency graph \(\mathcal {G}=(V,E)\) from the data set \(\mathbb {D}\). According to the classical Representer Theorem [40], the solution to Eq. (2) in \(\mathcal {H}_K\) is given by Ref. [3]

$$\begin{aligned} f^*(\mathbf {x})=\sum _{i=1}^{l+u}\theta _i K(\mathbf {x}_i,\mathbf {x}), \end{aligned}$$
(3)

which is an expansion of the Representer Theorem in terms of labeled data and unlabeled data \(\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u\). Accordingly, the problem is essentially an optimization problem over the space of coefficients \(\theta _i\).

The RKHS has been extended to vector-valued functions [8] to formulate the vector-valued manifold regularization [35]. Let \(\mathbf {F} = (f_1(\mathbf {x}_1),\cdots ,f_n(\mathbf {x}_n)) \in \mathcal {Y}^n\) be components of a vector-valued function where each \(f_i \in H_K\) [35]. Here \(\mathcal {Y}\) can be \(\mathbb {R}\) for the single label case or \(\mathbb {R}^L\) for multi-label case. The optimization problem of the vector-valued manifold regularization is given by Ref. [35]

$$\begin{aligned} f^*= & {} \mathop {\mathrm{arg\,min}}\limits _{f\in \mathcal {H}_K } \frac{1}{l}\sum _{i=1}^{l}V(\mathbf {x}_i,\mathbf {y}_i,f) + \gamma _A ||f||_K^2 \nonumber \\&+ \gamma _I <\mathbf {F},M\mathbf {F}>_{\mathcal {Y}^n}, \end{aligned}$$
(4)

where the matrix M is a symmetric, positive operator, such that \(<y,My>_{\mathcal {Y}^n}\) for all \(y\in \mathcal {Y}^n\). \(\mathcal {Y}^n\) is the n-direct product of \(\mathcal {Y}\), with the inner product

$$\begin{aligned}<(y_1,\cdots ,y_n),(w_1,\ldots ,w_n)>_{\mathcal {Y}^n} = \sum _{i=1}^n<y_i,w_i>_{\mathcal {Y}}. \end{aligned}$$

It has been proved in [35] that the minimization problem in (4) has a unique solution taking the form \(f^*(\mathbf {x})=\sum _{i=1}^{l+u}K(\mathbf {x}_i,\mathbf {x})\varvec{\Theta }_i\) for some vectors \(\varvec{\Theta }_i\in \mathcal {Y}, 1\le i \le n\). The vector-valued manifold regularization is a generalized form of manifold regularization, and can be used for single label, multi-label, and multi-view learning [35, 36].

The Representer Theorem in the vector-valued RKHS is given and proved in [35]. Let \(\mathcal {H}_{K,\mathbf {x}} = \{\sum _{i=1}^{u+l}K(\mathbf {x}_i,\mathbf {x})y_i, \mathbf {y}\in \mathcal {Y}^{u+l}\}\). For \(f\in \mathcal {H}_{K,\mathbf {x}}^\bot \), the sampling operator \(S_{\mathbf {x}}\) satisfies \(<S_\mathbf {x}f, \mathbf {y}>_{\mathcal {Y}^{u+l}} = <f,\sum _{i=1}^{u+l}K(\mathbf {x}_i,\mathbf {x})y_i>_{\mathcal {H}_K}=0\). This holds true for all \(\mathbf {y} \in \mathcal {Y}^{u+l}\) and yields \(S_\mathbf {x}f=(f(\mathbf {x}_1),\ldots , f(\mathbf {x}_{u+l}))=0\). Denote the right-hand side of (4) by I(f). Any arbitrary \(f\in \mathcal {H}_K\), can be decomposed orthogonally as \(f=f_0+f_1\), with \(f_0\in \mathcal {H}_{K,\mathbf {x}}\) and \(f_1 \in \mathcal {H}_{K,\mathbf {x}}^\bot \). This results in \(I(f)=I(f_0+f_1)\ge I(f_0)\) with equality if and only if \(||f_1||_{\mathcal {H}_K}=0\), since \(||f_0+f_1||_{\mathcal {H}_K}=||f_0||_{\mathcal {H}_K}+||f_1||_{\mathcal {H}_K}\). As a result, the minimizer of (4) must lie in \(\mathcal {H}_{K,\mathbf {x}}\).

The proposed method

The work in [3] initially proposed the manifold regularization, and showed that the Representer Theorem minimizes the error for Laplacian RLS in univariate cases; further, reference [35] proved the Representer Theorem for the general cases of the vector manifold regression. Following the two fundamental theoretical works, this work on multi-label manifold regularization is essentially an important special case of the theorem in [35]. In the existing literature, there is no study on such a special case; in particular, no simpler proof has been advanced that the kernel coefficients in Eq. (3) remain a solution to the Laplacian RLS minimization. We are following a long tradition in mathematics where simpler proofs for interesting special cases remain valuable, even if the general case has been proven. For instance, Dirichlet’s theorem was first proved in [17] in the 19th century. Nonetheless, studies of special cases of Dirichlet’s theorem, especially those having elementary proofs (e.g., [24, 38, 43]), continue to this day [34]. Analogously, studying the multi-label classification case of MR also seems an interesting and novel contribution. We also introduce the reliance weighting strategy, and prove that our modified algorithm remains a solution to the Laplacian RLS problem. The major challenges include: (1) the formulation of the optimization problem of manifold regularization with multiple labels given that the data structure is different from the single-labeled data, (2) the solving of the optimization problem to guarantee that a unique global solution exists, (3) the derivation of the solution by including a reliance weight matrix.

Graph construction

Given the whole data set \(\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u\), a full \(n\times n\) distance matrix \(\mathbf {U}\) is calculated between each pair of instances \(\mathbf {x}_i, \mathbf {x}_j\in \mathcal {X}\) based on a Gaussian kernel \(K(\mathbf {x}_i, \mathbf {x}_j)\) as

$$\begin{aligned} U_{ij} = K(\mathbf {x}_i, \mathbf {x}_j) = \exp \left( -\frac{|| \mathbf {x}_i -\mathbf {x}_j ||^2}{2\sigma ^2} \right) , \end{aligned}$$
(5)

where \(\sigma \) denotes the bandwidth of the Gaussian kernel. Equivalently, an alternative distance matrix \(\mathbf {H}\) can be calculated with each element \(H_{ij}\) given by Refs. [26, 55]

$$\begin{aligned} H_{ij}=\sqrt{U_{ii}+U_{jj}-2U_{ij}}. \end{aligned}$$
(6)

The constructed graph \(\mathcal {G}=(V,E)\) is a fully connected graph with each edge \(E_{ij}\) weighted by \(H_{ij}\). According to [26, 55], graph sparsification can improve the efficiency of label inference. Edges are removed producing an \(n\times n\) binary matrix \(\mathbf {B}\) with 1’s and 0’s representing the presence and absence of connections, respectively. Three sparsification approaches can be used, including the \(\varepsilon \)-neighbor search, k-nearest neighbor search, and the b-matching [26, 55]:

  1. 1.

    The \(\varepsilon \)-neighbor search recovers a binary matrix \(\mathbf {B}\) as

    $$\begin{aligned} B_{ij} = \left\{ \begin{array}{ccl} 1 &{} \text{ if } &{} 1-H_{ij}\le \varepsilon \\ 0 &{} \text{ if } &{} 1-H_{ij}> \varepsilon \text{ or } i=j \end{array}\right. . \end{aligned}$$
    (7)
  2. 2.

    The k-nearest neighbor search obtains the binary matrix \(\mathbf {B}\) by minimizing the following optimization problem:

    $$\begin{aligned} \begin{aligned}&\min _{\mathbf {B}\in \{0,1\}^{n\times n}} \sum _{i=1}^{n}\sum _{j=1}^{n} B_{ij} H_{ij} \\&\text {s.t. } \sum _{j=1}^{n}B_{ij}=k,B_{ii}=0,\forall i,j=1,\ldots ,n. \end{aligned} \end{aligned}$$
    (8)
  3. 3.

    Using the b-matching algorithm, the optimization problem to recover \(\mathbf {B}\) is

    $$\begin{aligned} \begin{aligned}&\min _{\mathbf {B}\in \{0,1\}^{n\times n}} \sum _{i=1}^{n}\sum _{j=1}^{n} B_{ij} H_{ij} \\&\text {s.t. } \sum _{j=1}^{N}B_{ij}=b,B_{ii}=0,B_{ij}=B_{ji},\forall i,j=1,\ldots ,n. \end{aligned}\nonumber \\ \end{aligned}$$
    (9)

The binary matrix \(\mathbf {B}\) obtained using the k-nearest neighbor search is not symmetric; thus the final \(\mathbf {B}\) can be calculated as \(B_{ij}=\max (B_{ij},B_{ji})\). By contrast, the b-matching algorithm produces a graph with every node having the same number of neighbors, namely \(\mathbf {B}=\mathbf {B}^T\). Whichever of the above methods is applied, the weight for edge \(E_{ij}\) is set to 0 if \(B_{ij}=0\). For an edge \(E_{ij}\) with \(B_{ij}=1\), the weight \(W_{ij}\) can be calculated with respect to the distance matrix \(\mathbf {H}\) and expressed as

$$\begin{aligned} W_{ij}=H_{ij}B_{ij}. \end{aligned}$$
(10)

The final graph \(\mathcal {G}=(V,E)\) is then constructed and represented by a sparse weight matrix \(\mathbf {W}\). Proceeding to label inference, the graph Laplacian is calculated as \(\mathbf {L} = \mathbf {D}-\mathbf {W}\), where each element of \(\mathbf {D}\) is \(D_{ii}=\sum _{j=1}^{N} W_{ij}\) and \(D_{ij}=0\).

Manifold regularization with multiple labels

In this subsection, we extend the manifold regularization in [3] to solve multi-label learning problems. Let \(\mathbf {X}=[\mathbf {x}_1,\mathbf {x}_2,\ldots , \mathbf {x}_n]^T\) and \(\mathbf {Y}=[\mathbf {y}_1,\mathbf {y}_2,\ldots ,\mathbf {y}_n]^T\) denote the matrix of all feature instances and label instance. In \(\mathbf {Y}\), \(\mathbf {y}_i\) for \(i\le l\) takes 1 or \(-1\) for its elements and \(\mathbf {y}_i\) is an all-zero vector for \(l<i\le n\). In the framework of the Laplacian Regularized Least Squares (LapRLS) [3], the optimization problem of manifold regularization with multiple labels is

$$\begin{aligned} f^*= & {} \mathop {\mathrm{arg\,min}}\limits _{f_j\in \mathcal {H}_K,j=1,\ldots ,L } \frac{1}{l} {\text {tr}} \left( (\varvec{\Psi } \mathbf {F}-\mathbf {Y})^T (\varvec{\Psi } \mathbf {F}-\mathbf {Y}) \right) \nonumber \\&+ \gamma _A ||f||_K^2 + \frac{\gamma _I}{n^2} {\text {tr}} \left( \mathbf {F}^T\mathbf {L}\mathbf {F} \right) , \end{aligned}$$
(11)

where \(\mathbf {F}=[f_j(\mathbf {x}_i)]_{n\times L}, i=1,\ldots ,n, j=1,\ldots , L\) is a matrix representing the predicted outputs, \({\text {tr}}(\cdot )\) denotes the trace of a matrix, and \(\varvec{\Psi }\) is a \(n\times n\) diagonal matrix with the diagonal elements given by

$$\begin{aligned} \Psi _{ii}=\left\{ \begin{array}{ccl} 1 &{} \text{ for } &{} i \le l, \\ 0 &{} \text{ for } &{} l < i \le n. \end{array}\right. . \end{aligned}$$
(12)

The second term \(||f||_K^2 = \sum _{j=1}^{L}||f_j||_K^2\) in Eq. (11) measures the complexity of \(\mathbf {F}\) in the ambient space. The third term represents the intrinsic smoothness with respect to the geometric distribution. \(\mathbf {L}\) is the graph Laplacian obtained in the graph construction phase. The optimization problem in (11) is essentially one natural extension of the LapRLS for multi-label cases as indicated in [35].

The minimization problem in Eq. (11) is guaranteed to have a unique global solution. The theorem for the solution in (11) are given and proved as follows.

Theorem 1

The minimizer of optimization problem in Eq. (11) admits an expansion

$$\begin{aligned} f_j^*(\mathbf {x}) = \sum _{i=1}^{n} \Theta _{ij}K(\mathbf {x}_i,\mathbf {x}), j=1,2,\ldots , L \end{aligned}$$
(13)

in terms of the labeled and unlabeled instances; \(K(\cdot ,\cdot )\) represents the kernel function, which must be positive semi-definite.

Proof

In the multi-label classification problem (11), the norm of the function f can be represented by the sum of each function \(f_j\) in the Reproducing Kernel Hilbert Space \(\mathcal {H}_K\), i.e., \(||f||_K^2 = \sum _{j=1}^{L}||f_j||_K^2\).

Any function in the RKHS \(\mathcal {H}_K\) can be decomposed into two orthogonal components; specifically, each \(f_j\), can be decomposed to a function \(f_j^0\) in the linear subspace spanned by \(\{ K(x_i,\cdot )\}_{i=1}^{n}\) and \(f_j^1\) orthogonal to \(f_j^0\) [3]. Accordingly, \(f_j\) can be represented by

$$\begin{aligned} f_j = f_j^0 + f_j^1 = \sum _{i=1}^{n} \Theta _{ij}K(x_i,\cdot ) + f_j^1, \end{aligned}$$

Since \(||f_j||_K^2=||f_j^0||_K^2+||f_j^1||_K^2\ge ||f_j^0||_K^2\), there is

$$\begin{aligned} ||f||_K^2&= \sum _{j=1}^{L}||f_j||_K^2 = \sum _{j=1}^{L}||f_j^0||_K^2\\&\quad + \sum _{j=1}^{L}||f_j^1||_K^2\ge \sum _{j=1}^{L}||f_j^0||_K^2 \end{aligned}$$

The equality is achieved if and only if \(||f_j^1||_K^2=0\), \(j=1,2,\ldots , L\). Therefore the minimizer must be \(f_j^*(\mathbf {x}) = \sum _{i=1}^{n} \Theta _{ij}K(\mathbf {x}_i,\mathbf {x})\), \(j=1,2,\ldots , L\). \(\square \)

Denote the \(\mathbf {K}\) as a \(n\times n\) matrix of the kernel estimation with respect to all the data samples \(\mathbf {X}\), and \(\varvec{\Theta }\) as a \(n\times L\) matrix of the coefficients. The solution can be represented by

$$\begin{aligned} \mathbf {F}^* = \mathbf {K}\varvec{\Theta }. \end{aligned}$$
(14)

Therefore, the problem in Eq. (11) is reduced to optimizing over the finite dimensional space of coefficients \(\varvec{\Theta }\). According to [3], the kernel function \(K(\cdot ,\cdot )\) must be positive semi-definite which gives rise to an RKHS. A choice of the kernel function is the heat kernel, which can be approximated using a sharp Gaussian kernel. Thus, \(\mathbf {U}\) in Eq. (5) can be taken as the kernel matrix \(\mathbf {K}\).

Reliance weighted kernel for performance improvement

In the framework of manifold regularization, the classifier is trained using both the labeled training set \(\mathbb {D}_l\) and the unlabeled training set \(\mathbb {D}_u\). Although both \(\mathbb {D}_l\) and \(\mathbb {D}_u\) contribute to the classification, the prediction of the label vector \({{\tilde{\mathbf{y}}}}\) of an unforeseen future sample \({{\tilde{\mathbf{x}}}}\) is based on the label information provided by the labeled training set \(\mathbb {D}_l\). Naturally, this motivates us to have more trust in the labeled training set than the unlabeled one for out-of-sample prediction. Thus, a reliance weighting strategy is proposed to assign different weights to the training instances allowing samples from \(\mathbb {D}_l\) to have greater influence than those from \(\mathbb {D}_u\). Given a heat kernel function \(K(\mathbf {x}_i,\mathbf {x})\), the weighted kernel function for \(\mathbf {x}\) is

$$\begin{aligned} {\tilde{K}}(\mathbf {x}_i,\mathbf {x}) = K(\mathbf {x}_i,\mathbf {x}) \cdot \varXi _{i}, \end{aligned}$$
(15)

where \(\varXi _{i}\) represents the reliance weight of the ith instance. Denote the \({{\tilde{\mathbf{K}}}}\) as the matrix of the weighted kernel estimation with respect to all the data samples \(\mathbf {X}\), and the reliance weight matrix \(\varvec{\varXi }\) as

$$\begin{aligned} \varvec{\varXi } = \left[ \begin{array}{cccc} \varXi _{1} &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad \varXi _{2} &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad \varXi _{n}\\ \end{array} \right] \end{aligned}$$
(16)

Then, the weighted kernel matrix is \({{\tilde{\mathbf{K}}}}=\mathbf {K}\varvec{\varXi }\). To yield to the minimizer in (13), the kernel function \({\tilde{K}}(\cdot ,\cdot )\) must be positive semi-definite.

Proposition 1

Given a heat kernel function \(K(\cdot ,\cdot )\), the weighted kernel \({\tilde{K}}(\cdot ,\cdot )=K(\cdot ,\cdot ) \cdot \varXi _{i}\) is positive semi-definite if and only if \(\varXi _{i}\ge 0\).

Proof

Given an arbitrary vector \(\mathbf {v}\in \mathbb {R}^d\), we have

$$\begin{aligned} \mathbf {v}^T{{\tilde{\mathbf{K}}}}\mathbf {v} = \sum _{i=1}^{d}\sum _{j=1}^{d} K(\mathbf {x}_i,\mathbf {x}_j) \cdot \varXi _{i} \cdot v_i v_j. \end{aligned}$$
(17)

where \(v_i\) and \(v_j\) are the ith and jth elements of \(\mathbf {v}\). The kernel estimation based on a heat kernel function is always nonnegative, namely \(K(\mathbf {x}_i,\mathbf {x}_j)\ge 0\). Therefore, \(K(\mathbf {x}_i,\mathbf {x}_j) \cdot \varXi _{i}\ge 0\) if and only if \(\varXi _{i}\ge 0\). Accordingly, \(\mathbf {v}^T{{\tilde{\mathbf{K}}}}\mathbf {v} \ge 0\) if and only if \(\varXi _{i}\ge 0\). As a conclusion, the weighted kernel \({\tilde{K}}(\cdot ,\cdot )=K(\cdot ,\cdot ) \cdot \varXi _{i}\) is positive semi-definite if and only if \(\varXi _{i}\ge 0\). \(\square \)

Using the reliance weighted kernel function instead of the heat kernel function, the solution in (14) becomes

$$\begin{aligned} \mathbf {F}^*= {{\tilde{\mathbf{K}}}}\varvec{\Theta } = \mathbf {K}\varvec{\varXi }\varvec{\Theta }. \end{aligned}$$
(18)

The coefficient matrix \(\varvec{\Theta }^*\) can be estimated by differentiating the right hand side of (11) as

$$\begin{aligned}&\frac{2}{l}\varvec{\Psi } \mathbf {K}\varvec{\varXi }(\varvec{\Psi } \mathbf {K}\varvec{\varXi } \varvec{\Theta }^*-\mathbf {Y}) + 2\gamma _A \mathbf {K}\varvec{\varXi } \varvec{\Theta }^*\\&\qquad + \frac{2\gamma _I}{n^2} (\mathbf {K}\varvec{\varXi })^T \mathbf {L} \mathbf {K}\varvec{\varXi } \varvec{\Theta }^* = 0 \end{aligned}$$

The coefficient matrix is eventually obtained as

$$\begin{aligned} \varvec{\Theta }^* = \left( \varvec{\Psi }\mathbf {K}\varvec{\varXi } + l\gamma _A\mathbf {I} + \frac{l\gamma _I}{n^2} \mathbf {L} \mathbf {K} \varvec{\varXi } \right) ^{-1} \mathbf {Y}. \end{aligned}$$
(19)

where \(\mathbf {I}\) is a \(n\times n\) identity matrix.

For unforeseen future samples \({{\tilde{\mathbf{X}}}}=[{{\tilde{\mathbf{x}}}}_1,{{\tilde{\mathbf{x}}}}_2,\ldots , {{\tilde{\mathbf{x}}}}_e]^T\) in \(\mathbb {D}_e\), the label matrix \({{\tilde{\mathbf{F}}}}\) is obtained as follows: first, a \(e\times n\) kernel matrix \(\mathbf {K}_e\) is calculated using Eq. (5), i.e., \({\tilde{K}}_{ij} = K({{\tilde{\mathbf{x}}}}_i, \mathbf {x}_j)\) for \(i=1,2,\ldots ,e\) and \(j=1,2,\ldots ,n\). Next, the output \({{\tilde{\mathbf{F}}}}\) for \({{\tilde{\mathbf{X}}}}\) can be calculated as

$$\begin{aligned} {{\tilde{\mathbf{F}}}} = \mathbf {K}_e\varvec{\varXi }\varvec{\Theta }^*. \end{aligned}$$
(20)

Eventually, the label matrix \({{\tilde{\mathbf{Y}}}}\) of \({{\tilde{\mathbf{X}}}}\) is obtained by comparing each element of \({{\tilde{\mathbf{F}}}}\) with 0. We will henceforth refer to our multi-label extension of MR as Multi-Label Manifold Regularization (ML-MR), and our reliance weighting augmentation as ML-MR with Reliance Weighting (ML-MRRW).

There are clearly many strategies for determining reliance weights. The simplest strategy is to assign uniform weights, namely \(\varXi _{i}=\nu _1\in [0,1], 1\le i \le l\) and \(\varXi _{i}=\nu _2\in [0,1], l< i \le l+u\) for all labeled and unlabeled training instances, respectively. These two parameters then decide the balance of trust between labeled and unlabeled training data. The extended manifold regularization is supervised if \(\nu _1=1\) and \(\nu _2=0\) are used, and is unsupervised for the choice of \(\nu _1=0\) and \(\nu _2=1\). The relation \(\nu _1=\nu _2\) indicates that the impacts of \(\mathbb {D}_l\) and \(\mathbb {D}_u\) to label inference are equal, whereas \(\nu _1>\nu _2\) indicates that more weight is put on labeled instances \(\mathbb {D}_l\) than that on unlabeled instances \(\mathbb {D}_u\). In this work, we are trying to improve the performance of manifold regularization by trusting labeled instances more, and thus the choices of \(\nu _1\) and \(\nu _2\) must follow two criterions, namely \(\nu _1=1\) and \(\nu _1>\nu _2>0\).

Experimental design

This section designs experiments to validate the effectiveness of the proposed ML-MR and ML-MRRW methods on some commonly used benchmark data sets. Other semi-supervised multi-label classification methods are tested as comparisons, across a range of performance metrics.

Data sets

Four public data sets from different domains are chosen for the experimental study. Table 1 presents the basic information about these data sets. The first data set “Emotions” [52] consists of sampled wave forms of sound clips generated from different genres of musical songs. Each instance is labeled with 6 emotions: amazed-surprised, happy-pleased, relaxing-calm, quiet-still, sad-lonely, and angry-aggressive. The second data set “Scene” [7] is a commonly used image data set with each image represented by a 294-dimension feature vector and labeled with six classes: beach, sunset, field, fall-foliage, mountain, and urban. The third data set “Yeast” [19] consists of micro-array expression data and phylogenetic profiles for 2107 genes. Each gene is associated with a set of functional classes, which are grouped into 14 functional categories. The last data set “mediamill” [46] consists of digital video achieves for the TREC Video Retrieval Evaluation (TRECVID) challenge. This data set contains 120 features and 101 annotation concepts. These data sets are already formatted, so no further pre-processing is needed.

Table 1 Basic information of the selected public data sets

Experiment setup

In each experiment, the data set is first partitioned into two parts: the training data and out-of-sample testing data occupy two thirds and one third of the whole data set, respectively. Then, the labels of a portion of the instances in the training data are omitted to construct labeled training data and unlabeled training data. The labeling rate \(\eta \) is drawn from {5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%}. For each labeling rate, experiments are conducted 100 times by randomly resampling the labeled training data, unlabeled training data, and out-of-sample testing data. The first three data sets “Emotions”, “Scene”, and “Yeast” are fully used in the experiments, whereas only a portion (10% randomly selected) of the “Mediamill” data is used in view of the computational complexity of MR.

In the experiments, seven algorithms are carried out for comparisons: (1) the Multi-Label k Nearest Neighbors (MLkNN) [57], (2) the Multi-Label Gaussian Fields and Harmonic Functions (ML-GFHF) [56], (3) the Multi-Label Local and Global Consistency (ML-LGC) [56], (4) the Fixed-Size Multi-Label Regularized Kernel Spectral Clustering (ML-FSKSC) [33], (5) the Semi-Supervised Weak-Label approach (SSWL) [18], (6) the Multi-Label Manifold Regularization (ML-MR), and (7) the ML-MR with the Reliance Weighting strategy (ML-MRRW) in “Reliance weighted kernel for performance improvement”. It should be noted that all the seven algorithms are applied in the first three experiments. In the last experiment, only six algorithms are applied; the SSWL is not included in the comparison because the used personal computer failed to run the algorithm owing to the high computational burden. Among all of the algorithms, MLkNN is supervised and all the other algorithms are semi-supervised. Accordingly, the MLkNN algorithm only uses the labeled training data in the training phase, whereas all the other algorithms exploit both the labeled training data and unlabeled training data. The parameters in each algorithm are determined by parameter exploration using a small portion of the data. For the ML-MRRW algorithm, the two parameters for the reliance weighting strategy are fixed at \([\nu _1,\nu _2]=[1,0.1]\).

Performance metrics

Many performance metrics or criteria for multi-label classification have been proposed; reviews may be found in [47] and [58]. In this work, three popular metrics are used to evaluate the performances of the algorithms in learning multi-label problems.

The average precision calculates the average fraction of labels ranked above a particular label that are truly predicted. The larger the value of it, the better the learning performance:

$$\begin{aligned}&\text{ avgprec }(f)= \frac{1}{n} \sum _{i=1}^n \frac{1}{|\mathbf {y_i}|} \sum _{y_{ij}\in \mathbf {y_i}} \nonumber \\&\qquad \frac{|\{ y'_{ij}|\text{ rank}_f(\mathbf {x}_i,y'_{ij})\le \text{ rank}_f(\mathbf {x}_i,y_{ij}),y'_{ij}\in \mathbf {y}_i \}|}{\text{ rank}_f(\mathbf {x}_i,y_{ij})} \end{aligned}$$
(21)

where \(y'_i\) is the chosen particular label. \(y_{ij}\) is the jth label of instance i.

F1 is a popular measure for single label. It is the harmonic mean of precision and recall:

$$\begin{aligned} F1 = \frac{2\times tp}{2\times tp+fp+fn} \end{aligned}$$
(22)

where tp is the number of true positives, tn is the number of true negatives, fp is the number of false positives, and fn is the number of false negatives. Macro-F1 and Micro-F1 are multi-label classifier metrics derived by computing the F1 measure across the label set; either after summing true and false positives and false negatives across all labels, or by averaging the F1 measure for each label:

$$\begin{aligned}&F1_{micro}=F1\left( \sum _{\lambda =1}^{L}tp_\lambda ,\sum _{\lambda =1}^{L}fp_\lambda , \sum _{\lambda =1}^{L}fn_\lambda \right) \end{aligned}$$
(23)
$$\begin{aligned}&F1_{macro}=\frac{1}{L}\sum _{\lambda =1}^L F1 \left( tp_\lambda ,fp_\lambda ,fn_\lambda \right) \end{aligned}$$
(24)

where \(tp_\lambda \) is the number of true positives, \(fp_\lambda \) is the number of false positives, and \(fn_\lambda \) is the number of false negatives of label \(\lambda \) after being evaluated by binary evaluation of F1. Larger values of \(F1_{micro}\) and \(F1_{macro}\) denote better performance.

Significance test

Statistical tests are commonly used to ensure that differences between machine-learning algorithms are meaningful [15, 23, 44]. In this paper, the Friedman test and a post hoc test are utilized. Friedman’s Test is a simple and robust nonparametric method for testing the differences between multiple algorithms over multiple data sets. It ranks the algorithms from the smallest rank to the largest rank based on their performance scores for each data set separately, and average ranks are assigned to ties. For instance, the best performing algorithm is assigned rank 1, the second best performing algorithm is assigned rank 2, \(\ldots \). Denote \(R_{i}\) as the sum of ranks for the ith algorithm (\(i=1,2,3, \ldots , K\)) over N different data sets. Then, the Friedman’s statistic \(F_{R}\) [22, 44] is given by

$$\begin{aligned} F_{R}=\frac{12}{NK(K+1)}\sum _{i=1}^{K} R_{i}^2 -3N(K+1). \end{aligned}$$
(25)

The null hypothesis \(H_{0}\) is that there are no significant differences between the algorithms, the alternative hypothesis \(H_{1}\) is that there are significant differences between the algorithms. \(F_{R}\) tests the null hypothesis \(H_{0}\) against the alternative hypothesis \(H_{1}\). For K larger than 5, the distribution of \(F_{R}\) can be approximated by a Chi-square distribution with \(K-1\) degree of freedom. Thus, for any pre-chosen \(\alpha \) level of significance, the null hypothesis \(H_{0}\) is rejected if \(F_{R}>\chi _{\alpha }^2\). In this paper, there are 7 algorithms applied to the first three data sets, so \(K-1=6\). Thus, the critical Chi-square value is \(\chi _{\alpha }^2=12.592\) given \(\alpha =0.05\). There are six algorithms carried out to the last data set, namely Mediamill, so \(K-1=5\). Thus, the critical Chi-square value is \(\chi _{\alpha }^2=11.070\) given \(\alpha =0.05\).

When the null hypothesis is rejected, the analysis continues with a post hoc test [44]. Denote the difference \(D_{ij}=R_{i}-R_{j}\) between the rank sums of algorithms i and j. The performance of two algorithms is significantly different if the difference \(|D_{ij}|\) between their corresponding rank sums is no less than the critical difference

$$\begin{aligned} CD=z\sqrt{ \frac{NK(K+1)}{6} }, \end{aligned}$$
(26)

where z is the z-score from the standard normal curve corresponding to \(\frac{\alpha }{K(K-1)}\), and \(\alpha \) is the level of significance. It can be concluded that the performance of the algorithm i is significantly better than that of the algorithm j, if \(|D_{ij}|\ge CD\) and \(D_{ij}<0\); otherwise, worse, if \(|D_{ij}|\ge CD\) and \(D_{ij}>0\).

Fig. 1
figure 1

Performance metrics vs. labeling rates for seven classification algorithms applied to the “Emotions” data

Experimental results and discussion

We compare the proposed ML-MR and ML-MRRW against four well-known semi-supervised, and one supervised, multi-label algorithms on the chosen data sets. When calculating the Friedman’s statistic test and post hoc statistic test for each data set, the ten sampled data sets under each labeling rate (from 5 to \(50\%\)) are considered as different data sets.

Case I: Emotions

The experimental results for the “Emotions” data are shown in Fig. 1. The sub-figures from left to right present the A-precision (A-precision stands for average precision), Micro-F1, and Macro-F1 for all the algorithms under different labeling rates, respectively. The error bars indicate one standard deviation of the metrics. Table 2 presents the calculated Friedman’s statistics \(F_{R}\) based on ranking scores for the three different performance metrics; all of them are greater than the critical Chi-square value \(\chi _{\alpha }^2=12.592\). Thus, the null hypothesis is rejected, and it can be concluded that there are significant differences between the performances of the seven algorithms.

Table 2 The Friedman’s statistics \(F_{R}\) for different performance metrics in Case I
Table 3 The differences between the rank sums of the ML-MRRW and the other algorithms in Case I (MLkNN, ML-GFHF, ML-LGC, ML-FSKSC, SSWL, ML-MR, and ML-MRRW are denoted by algorithms 1, 2, 3, 4, 5, 6, and 7)
Table 4 Comparison with the state-of-the-art literature [31] on the “Emotions” data
Table 5 Comparison with supervised multi-label ensemble algorithms in [37] on the “Emotions” data

Further, post hoc test is carried out. The differences between the rank sums of the ML-MRRW and the other algorithms are calculated and presented in Table 3. Denote MLkNN, ML-GFHF, ML-LGC, ML-FSKSC, SSWL, ML-MR, and ML-MRRW by algorithms 1, 2, 3, 4, 5, 6, and 7, respectively. Then, \(D_{7i},i=1,2,\ldots ,6\) represents the difference between rank sums of the ML-MRRW and the ith algorithm. The critical difference for \(K=7\) and \(\alpha = 0.05\) is \(CD=9.2815\). For each performance metric, any difference value \(|D_{7i}|\ge CD\) indicates a significant difference between ML-MRRW and the algorithm i with respect to this metric. Further, \(|D_{7i}|\ge CD\) and \(D_{7i}<0\) indicate ML-MRRW outperforms the algorithm i. From Table 3, \(D_{71}\), \(D_{73}\), \(D_{74}\), \(D_{75}\) and \(D_{76}\) are less than 0 and their absolute values are larger than the critical value \(CD=9.2815\) with respect to A-precision; thus, ML-MRRW outperforms MLkNN, ML-LGC, ML-FSKSC, SSWL, and ML-MR in terms of A-precision. Moreover, \(D_{71}\), \(D_{72}\), \(D_{74}\), \(D_{75}\) and \(D_{76}\) are less than 0 and their absolute values are larger than the critical value \(CD=9.2815\) with respect to Micro-F1 and Macro-F1; thus, it outperforms MLkNN, ML-GFHF, ML-FSKSC, SSWL, and ML-MR in terms of Micro-F1 and Macro-F1.

In general, the following conclusions can be drawn from the plots and tables:

  1. 1.

    SSWL does not work well under low labeling rates, however, it improves the performance very much as labeling rate increases. It works almost the same as MLkNN as labeling rate higher than \(30\%\). The other five semi-supervised multi-label learning algorithms show much better overall performances compared to the MLkNN and SSWL methods, except that ML-FSKSC has lower A-precision for large labeling rates.

  2. 2.

    The ML-MRRW algorithm has the highest A-precision, Micro-F1, and Macro-F1 among all the multi-label learning algorithms for most of the labeling rates. Specifically, it defeats all the other approaches except ML-GFHF in terms of A-precision, and it outperforms all the other methods except ML-LGC regarding Micro-F1 and Macro-F1.

  3. 3.

    Overall, ML-MRRW outperforms all the other algorithms.

Moreover, ML-MRRW is also compared with supervised multi-label algorithms from the state-of-the-art literature [31], and supervised multi-label ensemble algorithms in [37] on the “Emotions” data in Tables 4 and 5, respectively. The performance metrics include the mean values of A-precision, Micro-F1, and Macro-F1. The second last column presents the three metrics achieved by ML-MRRW under the labeling rate of \(50\%\) (also shown in Fig. 1). It can be found that ML-MRRW under this labeling rate outperforms most algorithms in terms of A-precision, Micro-F1, and Macro-F1. It also outperforms some ensemble algorithms, including \(MLS_{\text {train}}\), HOMER, AdaB.MH, TREMLC, and CBMLC, and it does almost as well as the other ensemble methods in Table 5 under the 50% labeling rate. The last column presents the metrics as the labeling rate increases to \(70\%\); at this labeling rate, ML-MRRW is found to outperform all of the baselines in both Tables 4 and  5.

Fig. 2
figure 2

Performance metrics vs. labeling rates for seven classification algorithms applied to the “Scene” data

Case II: Scene

The experimental results for the “Scene” data are shown in Fig. 2. Table 6 presents the calculated Friedman’s statistics \(F_{R}\) according to ranking scores for the three different performance metrics. It can be found that all of them are greater than the critical Chi-square value \(\chi _{\alpha }^2=12.592\). Thus, the null hypothesis is rejected, and it can be concluded that there are significant differences between the performances of the seven algorithms. Further, the differences between the rank sums of the ML-MRRW and the other algorithms are calculated and presented in Table 7. From Table 7, \(D_{71}\), \(D_{72}\), \(D_{73}\), \(D_{74}\), \(D_{75}\) and \(D_{76}\) are less than 0 and their absolute values are larger than the critical value \(CD=9.2815\) with respect to A-precision; thus, ML-MRRW outperforms MLkNN, ML-GFHF, ML-LGC, ML-FSKSC, SSWL and ML-MR in terms of A-precision. Moreover, \(D_{71}\), \(D_{72}\), \(D_{73}\), \(D_{75}\) and \(D_{76}\) are less than 0 and their absolute values are larger than the critical value \(CD=9.2815\) with respect to Micro-F1 and Macro-F1; thus, it outperforms MLkNN, ML-GFHF, ML-LGC, SSWL and ML-MR in terms of Micro-F1 and Macro-F1.

Generally, the following conclusions can be drawn from the plots and tables:

  1. 1.

    SSWL works worse than the other approaches. It does not work well under low labeling rates, but it improves the performance a lot as labeling rate increases.

  2. 2.

    The A-precision of ML-LGC, ML-GFHF, ML-FSKSC, and MLkNN, are quite close, whereas the ML-MR and ML-MRRW have significantly larger values on this metric under different labeling rates.

  3. 3.

    ML-MRRW defeats all the other algorithms in terms of A-precision, and it outperforms all the other approaches except ML-FSKSC regarding Micro-F1 and Macro-F1.

  4. 4.

    Overall, ML-MRRW performs better than ML-FSKSC in terms of A-precision. ML-FSKSC and ML-MRRW achieve the best performances in terms of Micro-F1 and Macro-F1. ML-MRRW performs better than ML-FSKSC in terms of Micro-F1 and Macro-F1 under high labeling rates and worse under low labeling rates.

Moreover, ML-MRRW is also compared with supervised multi-label algorithms from the state-of-the-art literature [31] and supervised multi-label ensemble algorithms in [37] on the “Scene” data in Tables 8 and 9, respectively. The second last column presents the mean values of A-precision, Micro-F1, and Macro-F1 for ML-MRRW under the labeling rate 50% (also shown in Fig. 2). From Table 8, ML-MRRW under this labeling rate outperforms HOMER, ML-C4.5, PCT, and ML-KNN in terms of A-precision, outperforms ML-C4.5, PCT, ML-KNN, and RF-PCT in terms of Macro-F1, and outperforms ML-C4.5, PCT, RFML-C4.5 and RF-PCT in terms of Micro-F1. It also outperforms some ensemble algorithms, including \(MLS_{train}\), HOMER, AdaB.MH, and CBMLC, and it does almost as well as the other ensemble methods in Table 9. The last column presents the metrics as the labeling rate increases to \(90\%\); at this level, ML-MRRW is found to outperform all the baselines in both Tables 8 and  9.

Table 6 The Friedman’s statistics \(F_{R}\) for different performance metrics in Case II
Table 7 The differences between the rank sums of the ML-MRRW and the other algorithms in Case II (MLkNN, ML-GFHF, ML-LGC, ML-FSKSC, SSWL, ML-MR, and ML-MRRW are denoted by algorithms 1, 2, 3, 4, 5, 6, and 7)
Fig. 3
figure 3

Performance metrics vs. labeling rates for seven classification algorithms applied to the “Yeast” data

Table 8 Comparison with the state-of-the-art literature [31] on the “Scene” data
Table 9 Comparison with supervised multi-label ensemble algorithms in [37] on “Scene” data

Case III: Yeast

The experimental results for the “Yeast” data are shown in Fig. 3. Table 10 presents the calculated Friedman’s statistics \(F_{R}\) for the three different performance metrics. It can be found that all of them are greater than the critical Chi-square value \(\chi _{\alpha }^2=12.592\). Thus, the null hypothesis is rejected, and it can be concluded that there are significant differences between the performances of the 7 algorithms. Further, the differences between the rank sums of the ML-MRRW and the other algorithms are calculated and presented in Table 11. From Table 11, \(D_{71}\), \(D_{72}\), \(D_{73}\), \(D_{74}\), \(D_{75}\) and \(D_{76}\) are less than 0 and their absolute values are larger than the critical value \(CD=9.2815\) with respect to A-precision and Micro-F1; thus, ML-MRRW outperforms MLkNN, ML-GFHF, ML-LGC, ML-FSKSC, SSWL and ML-MR in terms of A-precision and Micro-F1. Moreover, \(D_{71}\), \(D_{72}\) and \(D_{76}\) are less than 0 and their absolute values are larger than the critical value \(CD=9.2815\) with respect to Macro-F1; thus, it outperforms MLkNN, ML-GFHF and ML-MR in terms of Macro-F1.

Table 10 The Friedman’s statistics \(F_{R}\) for different performance metrics in Case III
Table 11 The differences between the rank sums of the ML-MRRW and the other algorithms in Case III (MLkNN, ML-GFHF, ML-LGC, ML-FSKSC, SSWL, ML-MR, and ML-MRRW are denoted by algorithms 1, 2, 3, 4, 5, 6, and 7)
Table 12 Comparison with the state-of-the-art literature [31] on the “Yeast” data
Table 13 Comparison with supervised multi-label ensemble algorithms in [37] on “Yeast” data

In general, the following conclusions can be drawn from the plots and tables:

  1. 1.

    SSWL does not work well under low labeling rates, but it improves the performance a lot as labeling rate increases. Furthermore, it outperforms the other methods with labeling rate higher than \(15\%\) in terms of Macro-F1.

  2. 2.

    The ML-MRRW and ML-MR algorithms have the best performances in terms of the A-precision and Micro-F1 for all the labeling rates.

  3. 3.

    ML-MRRW has the superior performance among all the algorithms in terms of Micro-F1 and A-precision, but it performs worse than ML-FSKSC under all labeling rates considering Macro-F1. It performs worse than SSWL and ML-LGC with high labeling rates and low labeling rates, respectively.

Moreover, ML-MRRW is also compared with supervised multi-label algorithms from the state-of-the-art literature [31] and supervised multi-label ensemble algorithms in [37] on the “Yeast” data in Tables 12 and 13, respectively. The second last column presents the mean values of the A-precision, Micro-F1, and Macro-F1 for ML-MRRW under the labeling rate 50% (also shown in Fig. 3). From Table 12, ML-MRRW under this labeling rate outperforms all the algorithms in terms of A-precision, outperforms ML-C4.5, PCT, ML-KNN, RFML-C4.5, and RF-PCT in terms of Micro-F1, and it outperforms all the algorithms except for HOMER in terms of Micro-F1. It also outperforms some ensemble algorithms, including EBR, \(MLS_{\text {train}}\), AdaB.MH, ELP, EPS, TREMLC, RF-PCT, and CBMLC, and it does almost as well as the other ensemble methods in Table 13. The last column presents the metrics as the labeling rate increases to \(75\%\); at this level, ML-MRRW is found to outperform all the baselines in both Tables 12 and 13.

Fig. 4
figure 4

Performance metrics vs. labeling rates for six classification algorithms applied to the “Mediamill” data

Case IV: Mediamill

The experimental results for the “Mediamill” data are shown in Fig. 4. Table 14 presents the calculated Friedman’s statistics \(F_{R}\) for the three different performance metrics. It can be found that all of them are greater than the critical Chi-square value \(\chi _{\alpha }^2=11.070\). Thus, the null hypothesis is rejected, and it can be concluded that there are significant differences between the performances of the six algorithms.

Further, the differences between the rank sums of the ML-MRRW and the other algorithms are calculated and presented in Table 15. Denote MLkNN, ML-GFHF, ML-LGC, ML-FSKSC, ML-MR, and ML-MRRW by algorithms 1, 2, 3, 4, 5, and 6, respectively. Then, \(D_{6i},i=1,2,\ldots ,5\) represents the difference between rank sums of the ML-NRRW and the ith algorithm. The critical difference for \(K=6\) and \(\alpha = 0.05\) is \(CD=7.7658\). For each performance metric, any difference value \(|D_{6i}|\ge CD\) indicates a significant difference between ML-MRRW and the algorithm i with respect to this metric. Further, \(|D_{6i}|\ge CD\) and \(D_{6i}<0\) indicate ML-MRRW outperforms the algorithm i. From Table 15, \(D_{61}\) and \(D_{63}\) are less than 0 and their absolute values are larger than the critical value \(CD=9.2815\) with respect to A-precision; thus, ML-MRRW outperforms MLkNN and ML-LGC in terms of A-precision. Moreover, \(D_{61}\), \(D_{62}\), \(D_{63}\), \(D_{64}\) and \(D_{65}\) are less than 0 and their absolute values are larger than the critical value \(CD=9.2815\) with respect to Micro-F1 and Macro-F1; thus, it outperforms all the other algorithms in terms of Micro-F1 and Macro-F1.

Generally, the following conclusions can be drawn from the plots and tables:

  1. 1.

    From the sub-figure of A-precision, the ML-MRRW and ML-MR outperform MLkNN and ML-LGC. They perform better than ML-GFHF and ML-FSKSC with high labeling rates but worse than them with low labeling rates.

  2. 2.

    From the sub-figures of Micro-F1 and Macro-F1, it can be seen that the ML-MR and ML-MRRW methods outperform all the other methods quite a lot under all labeling rates. Especially, the ML-MRRW method achieves the best performances regarding these two metrics.

  3. 3.

    Overall, ML-MRRW shows superior performances over all the other algorithms with Micro-F1 and Macro-F1 and it illustrates great potential for high-dimensional data sets with large number of labels.

Moreover, ML-MRRW is also compared with supervised multi-label algorithms from the state-of-the-art literature [31] and supervised multi-label ensemble algorithms in [37] on the “Mediamill” data in Tables 16 and 17, respectively. Note that these experiments in the literature consider the whole Mediamill data set, as opposed to a randomly selected subset (redrawn for each experimental run) as in our work. The second last column presents the mean values of the A-precision, Micro-F1, and Macro-F1 for ML-MRRW under the labeling rate 50% (also shown in Fig. 4). From Table 16, ML-MRRW under this labeling rate outperforms all algorithms in terms of the three metrics, except for RF-PCT in terms of A-precision. It is also superior to all the supervised ensemble algorithms in [37] from Table 17. The last column presents the metrics as the labeling rate increases to \(65\%\); at this level, ML-MRRW is found to outperform all the baselines in both Tables 16 and 17.

Table 14 The Friedman’s statistics \(F_{R}\) for different performance metrics in Case IV
Table 15 The differences between the rank sums of the ML-MRRW and the other algorithms in Case IV (MLkNN, ML-GFHF, ML-LGC, ML-FSKSC, ML-MR, and ML-MRRW are denoted by algorithms 1, 2, 3, 4, 5 and 6)
Table 16 Comparison with the state-of-the-art literature [31] on the “Mediamill” data
Table 17 Comparison with supervised multi-label ensemble algorithms in [37] on “Mediamill” data

Conclusion

This paper studies the semi-supervised multi-label classification problem, and extends the graph-based manifold regularization to the multi-label case. The proposed method includes three essential components, including the graph construction, the manifold regularization with multiple labels, and the exploitation of a reliance weighting strategy. This last component is intended to improve the learning ability by assigning higher weights to labeled training set and lower weights to unlabeled training sets. Extensive experiments are conducted on four public data sets with different categories to test the performances of the proposed Multi-Label Manifold Regularization (ML-MR), both with and without the Reliance Weighting (RW) strategy. Other well-known semi-supervised and supervised multi-label algorithms are tested as comparisons. Generally, the experimental results show that the proposed ML-MRRW algorithm has overall better performance than all the other algorithms under different labeling rates. In addition, ML-MRRW shows better performance than ML-MR, indicating the proposed reliance weighting strategy is effective in improving the learning performance of the ML-MR method. Further, unlike the other algorithms, ML-MRRW works consistently well on all the data sets. Also ML-MRRW is compared with 12 supervised multi-label algorithms and 12 ensemble approaches from the literature on the public data sets. As evidenced by the results, ML-MRRW outperforms all the baselines by supervised methods on these data sets. All in all, ML-MRRW is a promising semi-supervised multi-label algorithm for classification.