Semi-supervised multi-label classification using an extended graph-based manifold regularization

Graph-based algorithms are known to be effective approaches to semi-supervised learning. However, there has been relatively little work on extending these algorithms to the multi-label classification case. We derive an extension of the Manifold Regularization algorithm to multi-label classification, which is significantly simpler than the general Vector Manifold Regularization approach. We then augment our algorithm with a weighting strategy to allow differential influence on a model between instances having ground-truth vs. induced labels. Experiments on four benchmark multi-label data sets show that the resulting algorithm performs better overall compared to the existing semi-supervised multi-label classification algorithms at various levels of label sparsity. Comparisons with state-of-the-art supervised multi-label approaches (which of course are fully labeled) also show that our algorithm outperforms all of them even with a substantial number of unlabeled examples.


Introduction
In many real-world applications, such as bioinformatics and video annotation, obtaining labeled data is sometimes very difficult, expensive and time-consuming. On the other hand, it may be simple and inexpensive to obtain unlabeled data. For instance, vast numbers of videos and images are available on the web. The large amount of unlabeled data can reveal useful information about the phenomena we are studying, e.g., estimating the distribution of the data as well as the data structure [68]. As a result, Semi-Supervised Learning (SSL) is drawing increasing interest in the machine-learning community [10].
Studies on SSL are extensive (e.g. [2,4,12,13,32,45,51,62,66]); detailed reviews may be found in [65] and [42]. The common purpose of semi-supervised algorithms is to exploit both labeled data and unlabeled data to create superior classifiers compared to labeled data alone. According to [10], self-training (also known as self-learning or self-labeling) is among the earliest approaches that use unlabeled data in classification. The idea of the self-training first appeared in [41]. In self-training, a classifier is first trained only with the labeled data, and then used to predict labels for some unlabeled data. Then, the classifier is re-trained with both the ground-truth and predicted labels, and used to predict additional labels. The process repeats until all examples are labeled. The authors in [42] use the expectationmaximization (EM) algorithm [14] for SSL. Co-Training [6] is a learning paradigm to address problems with strong structural prior knowledge available, and is regarded as a variant of EM on the probabilistic model [10,42]. It assumes that features can be split into two complementary and independent feature subsets and each feature subset is enough to train a classifier for the data. Then, each classifier uses its most confidently predicted points and their labels to teach the other classifier. The process of using the other classifier's most confidently predicted labels to teach itself is iterated until some criteria is achieved. Transductive learning is another approach, based on the idea of performing predictions only for test samples [10]; Transductive Support Vector Machines (TSVM) are one example [54]. Various extensions to the TSVM have been proposed [9,11,16,60]; the common point is that the algorithms try to learn a hyperplane over the labeled data and the unlabeled data by optimizing a tradeoff between maximizing the margin over the labeled data and regular-izing the decision boundary over low-density regions of all data samples.
Graph-based algorithms are an important sub-class of SSL that have recently attracted considerable attention [10,48,49]. Various graph-based SSL algorithms have been developed [3,5,25,28,53,55,56,59,64,67] and a number of successful applications can be found in recent publications [1,29,30,61]. Some popular graph-based algorithms include Local and Global Consistency [64], Gaussian Random Fields and Harmonic Functions [67], mincuts [5], greedy max-cut [55], and spectral graph transducers [28]. All the graph-based algorithms begin by constructing a graph with nodes representing data points, and edges representing similarity between the connected nodes. The labeled data points are then used to perform graph clustering or propagate labels from labeled points to unlabeled points, by minimizing the empirical cost over labeled data and regularizing the smoothness over the graph using all the data. Another representative SSL approach is manifold regularization [3], which assumes data points lie on a low-dimensional manifold in the input space [20,35,50].
At the same time, most above semi-supervised classification algorithms implicitly assume that class labels are mutually exclusive. However, in many application domains, such as image classification, bioinformatics and news categorization, each instance can represent more than one concept simultaneously; this is best represented as a vector of labels. In addition, human emotions and sentiments are sometimes regarded as a multi-label classification problem nowadays, e.g., multiple fine-grained emotions may coexist in a single tweet of a microblog [21]. In addition, multi-label classifiers have recently been utilized for recognizing crop diseases in agriculture [27]. The learning algorithms for these problems are the "multi-label classifiers" as reviewed in [47,58]. For instance, a well-known multi-label classifier is the Multi-Label k Nearest Neighbors (MLkNN) [57], which is an extension of the classical kNN method. References [31,37], and [39] study a variety of supervised multi-label algorithms and present extensive experiments to compare their performances.
Our focus in the current paper is the intersection of these two problems, to wit, the design of semi-supervised multilabel classifiers. There is relatively less work in the literature on this sub-problem, and a particular dearth of graph-based semi-supervised algorithms for the multi-label case. Some existing studies on semi-supervised algorithms include the Multi-Label Gaussian Fields and Harmonic Functions (ML-GFHF) [56], the Multi-Label Local and Global Consistency (ML-LGC) [56], the Fixed-Size Multi-Label Regularized Kernel Spectral Clustering (ML-FSKSC) [33], and the Semi-Supervised Weak-Label approach (SSWL) [18]. In spite of these results, the opportunities in this area are extensive. Better methods are needed for semi-supervised multi-label classification in many tasks.
In our previous work [29], we found that a multi-label extension of the Manifold Regularization algorithm [3] was quite effective for non-intrusive load monitoring. In the current paper, we seek to improve upon that algorithm, and determine how well our results generalize beyond that domain. We investigate a multi-label extension of the Manifold Regularization (MR) algorithm, augmented with a reliance weighting strategy to further improve classification performance. Reliance weights allow learning algorithms to differentiate between ground-truth and induced labels in constructing a classifier for a given data set. They take the form of an additional matrix term in the kernel expansion of the Laplacian Regularized Least Squares model learned in MR [3]. We evaluate our proposed algorithm in comparison with five other multi-label algorithms (four semi-supervised algorithms plus MLkNN), on a set of four benchmark data sets.
The key contributions of this work are: -The manifold regularization algorithm is extended to learn multi-label classifiers. -A weighting strategy is proposed to vary the trust placed in labeled and unlabeled instances when forecasting labels for unseen points. -The proposed approach is compared against four semisupervised, and one fully supervised, multi-label algorithms, and performs as well as or better than all of them.
The advantages of the proposed method are threefold: (1) the proposed method performs as well or better than the existing semi-supervised multi-label algorithms on the four data sets in the fifth section. It furthermore outperforms the state-ofthe art supervised multi-label algorithms (which of course are trained on fully labeled data), even when a substantial portion of the training set is unlabeled. (2) The proposed method has a low model complexity as the Manifold Regularization [3] assumes data points lie on a low-dimensional manifold in the input space. (3) The proposed reliance weighting strategy allows an analyst to specify different trust levels for groundtruth and induced labels. The disadvantage of the method mainly lies in the computational time required for the construction of the graph structure; this is a common problem in this class of algorithms. The remainder of this paper is organized as follows: First, we present the preliminaries, including introducing the basis and notations, regularization in reproducing Kernel Hilbert space and manifold regularization. Then, we present the proposed approach, including graph construction, manifold regularization with multiple labels and our reliance weighting strategy. After that, we describe the experimental design including introducing the data sets, experimental setup, performance metrics and statistical significance tests. Last, the experimental results and discussion are presented, and we offer a summary and discussion of future work.

Preliminaries
This section presents the notations and basics that are used throughout the paper, and reviews the manifold regularization algorithm.

Basics and notations
In the framework of semi-supervised learning, the data set D in the training phase consists of two parts, namely D = D l ∪ D u , where D l and D u indicate the labeled and unlabeled training data sets, respectively. Both D l and D u are drawn from the same distribution p(x), where x indicates a feature variable. In the single label case, the feature space and label space of a data set D are denoted by X = R d and Y = {−1, 1}, respectively. Then, the labeled and unlabeled training data sets are represented by . . , l + u}, where l and u indicate the numbers of labeled and unlabeled instances The total number of all training instances in D is n = l + u. The goal of semi-supervised learning with single label is to infer the labelsỸ = {ỹ i ∈ Y, i = 1, 2, . . . , e} for future instances D e = {x i ∈ X , i = 1, 2, . . . , e} given the training data set D = D l ∪ D u . [49,68] In the multi-label case, the label space of D is denoted by Y = {−1, 1} L , where L indicates the number of labels. Analogously, the labeled training data set becomes D l = {(x i , y i ) : x i ∈ X , y i ∈ Y, i = 1, 2, . . . , l} and the label vector is y i = [y i1 , y i2 , . . . , y i L ] T , whereas the other notations remain the same as the single label case. The goal of semi-supervised learning with multiple labels is to infer the labelsỸ = {ỹ i ∈ Y, i = 1, 2, . . . , e} for D e = {x i ∈ X , i = 1, 2, . . . , e} given D = D l ∪ D u .
Using the graph-based semi-supervised learning, a crucial step is to construct a graph G = (V , E) representing the connections between training instances x i ∈ X [49,56,68]. Specifically, G = (V , E) has n vertices V i and each vertex V i represents an instance x i , i = 1, 2, . . . , n. E i j is an edge connecting vertices V i and V j . There are three typical methods to construct such a graph, including the k nearest neighbor algorithm, ε distance measure and full connection. For example, using the k nearest neighbor algorithm, each edge E i j connects the vertices V i and V j if vertex V i is among the k nearest neighbors of vertex V j , or vertex V j is among the k nearest neighbors of vertex V i . A weight matrix W is defined over the graph G = (V , E), where W i j is the weight associates with edge E i j representing the similarity between vertices V i and V j (namely the training instances x i and x j ). Then, the unnormalized graph Laplacian is given by The label inference in graph-based SSL is usually based on two graph assumptions [56,68]: (1) the prediction should be close to the given labels on labeled vertices; (2) the prediction should be smooth on the whole graph (i.e., vertices that are close in the graph tend to have the same labels). The label inference algorithms for graph-based SSL can be categorized into two major classes: transductive learning (e.g., the graph Laplacian regularization [64,67]), and inductive learning (e.g., the manifold regularization [3]). Transductive learning infers labels only on the unlabeled training data and cannot make predictions on out-of-sample data. By contrast, inductive learning infers labels for the whole domain, i.e., a function f : X → Y is learned given D = D l ∪ D u and then the labels for D e are predicted. The work in this paper is based on the manifold regularization [3], which is a typical inductive learning method [63]. The next subsection revisits regularization in a reproducing kernel Hilbert space, which is the core of manifold regularization.

Regularization in reproducing kernel Hilbert space
For a Mercer kernel K : X × X → R, there exists an associated Reproducing Kernel Hilbert Space (RKHS) H K of functions X → R with the norm || · || K [40]. The standard supervised learning estimates an unknown function f ∈ H K from the labeled data set D l as where V (x i , y i , f ) is the loss function, such as the squared error loss (y i − f (x i )) 2 for regularized least squares (RLS). || f || 2 K is a regularization term in the RKHS imposing the smoothness condition on possible solutions. γ A balances the tradeoff between the empirical cost and the regularization term. l is the number of labeled instances.
The difference between semi-supervised learning to supervised learning lies in the utilization of the marginal distribution of D = D l ∪ D u to improve the learning performance in addition to the empirical cost obtained over the labeled data set D l . According to the discussions in [3], there is an identifiable relation between marginal distribution p(x) and conditional distribution p(y|x), i.e., if two instances x i , x j ∈ X are close in the intrinsic geometry of p(x), then their conditional distributions p(y|x i ) and p(y|x j ) are similar. Thus, another regularization term can be added to ensure that the solution is smooth with respect to the marginal distribution p(x). Incorporating the smoothness penalty term with respect to the graph Laplacian L, we derive the following optimization problem [3]: where T , and f T Lf is a penalty term that reflect the intrinsic structure of the probability distribution p(x). n = u + l is the number of total instances. The normalizing coefficient 1 n 2 is the natural scale factor for the empirical estimate of the Laplace operator. Coefficients γ A and γ I controls the complexity of the function in the ambient space and the intrinsic geometry of the p(x) respectively. In real-world data sets, p(x) is unknown, but an empirical estimate can be obtained from a sufficiently large amount of unlabeled data D u by assuming the data set lies on a manifold in R d and modeling the manifold with the adjacency graph G = (V , E) from the data set D. According to the classical Representer Theorem [40], the solution to Eq. (2) in H K is given by Ref. [3] which is an expansion of the Representer Theorem in terms of labeled data and unlabeled data D = D l ∪D u . Accordingly, the problem is essentially an optimization problem over the space of coefficients θ i . The RKHS has been extended to vector-valued functions [8] to formulate the vector-valued manifold regularization [35]. Let F = ( f 1 (x 1 ), · · · , f n (x n )) ∈ Y n be components of a vector-valued function where each f i ∈ H K [35]. Here Y can be R for the single label case or R L for multi-label case. The optimization problem of the vector-valued manifold regularization is given by Ref. [35] f * = arg min where the matrix M is a symmetric, positive operator, such It has been proved in [35] that the minimization problem in (4) has a unique solution taking the form f The vector-valued manifold regularization is a generalized form of manifold regularization, and can be used for single label, multi-label, and multi-view learning [35,36].
The Representer Theorem in the vector-valued RKHS is given and proved in [35]. Let As a result, the minimizer of (4) must lie in H K ,x .

The proposed method
The work in [3] initially proposed the manifold regularization, and showed that the Representer Theorem minimizes the error for Laplacian RLS in univariate cases; further, reference [35] proved the Representer Theorem for the general cases of the vector manifold regression. Following the two fundamental theoretical works, this work on multi-label manifold regularization is essentially an important special case of the theorem in [35]. In the existing literature, there is no study on such a special case; in particular, no simpler proof has been advanced that the kernel coefficients in Eq. (3) remain a solution to the Laplacian RLS minimization. We are following a long tradition in mathematics where simpler proofs for interesting special cases remain valuable, even if the general case has been proven. For instance, Dirichlet's theorem was first proved in [17] in the 19th century. Nonetheless, studies of special cases of Dirichlet's theorem, especially those having elementary proofs (e.g., [24,38,43]), continue to this day [34]. Analogously, studying the multi-label classification case of MR also seems an interesting and novel contribution. We also introduce the reliance weighting strategy, and prove that our modified algorithm remains a solution to the Laplacian RLS problem. The major challenges include: (1) the formulation of the optimization problem of manifold regularization with multiple labels given that the data structure is different from the single-labeled data, (2) the solving of the optimization problem to guarantee that a unique global solution exists, (3) the derivation of the solution by including a reliance weight matrix.

Graph construction
Given the whole data set D = D l ∪ D u , a full n × n distance matrix U is calculated between each pair of instances x i , x j ∈ X based on a Gaussian kernel K (x i , x j ) as where σ denotes the bandwidth of the Gaussian kernel. Equivalently, an alternative distance matrix H can be calculated with each element H i j given by Refs. [26,55] The constructed graph G = (V , E) is a fully connected graph with each edge E i j weighted by H i j . According to [26,55], graph sparsification can improve the efficiency of label inference. Edges are removed producing an n × n binary matrix B with 1's and 0's representing the presence and absence of connections, respectively. Three sparsification approaches can be used, including the ε-neighbor search, k-nearest neighbor search, and the b-matching [26,55]: 1. The ε-neighbor search recovers a binary matrix B as 2. The k-nearest neighbor search obtains the binary matrix B by minimizing the following optimization problem: 3. Using the b-matching algorithm, the optimization problem to recover B is The binary matrix B obtained using the k-nearest neighbor search is not symmetric; thus the final B can be calculated as B i j = max(B i j , B ji ). By contrast, the b-matching algorithm produces a graph with every node having the same number of neighbors, namely B = B T . Whichever of the above methods is applied, the weight for edge E i j is set to 0 if B i j = 0. For an edge E i j with B i j = 1, the weight W i j can be calculated with respect to the distance matrix H and expressed as The final graph G = (V , E) is then constructed and represented by a sparse weight matrix W. Proceeding to label inference, the graph Laplacian is calculated as

Manifold regularization with multiple labels
In this subsection, we extend the manifold regularization in [3] to solve multi-label learning problems. Let X = [x 1 , x 2 , . . . , x n ] T and Y = [y 1 , y 2 , . . . , y n ] T denote the matrix of all feature instances and label instance. In Y, y i for i ≤ l takes 1 or −1 for its elements and y i is an all-zero vector for l < i ≤ n. In the framework of the Laplacian Regularized Least Squares (LapRLS) [3], the optimization problem of manifold regularization with multiple labels is where . . , L is a matrix representing the predicted outputs, tr(·) denotes the trace of a matrix, and is a n × n diagonal matrix with the diagonal elements given by The second term || f || 2 (11) measures the complexity of F in the ambient space. The third term represents the intrinsic smoothness with respect to the geometric distribution. L is the graph Laplacian obtained in the graph construction phase. The optimization problem in (11) is essentially one natural extension of the LapRLS for multi-label cases as indicated in [35].
The minimization problem in Eq. (11) is guaranteed to have a unique global solution. The theorem for the solution in (11) are given and proved as follows.

Theorem 1
The minimizer of optimization problem in Eq. (11) admits an expansion in terms of the labeled and unlabeled instances; K (·, ·) represents the kernel function, which must be positive semidefinite.
Proof In the multi-label classification problem (11), the norm of the function f can be represented by the sum of each function f j in the Reproducing Kernel Hilbert Space H K , i.e., || f || 2 K = L j=1 || f j || 2 K . Any function in the RKHS H K can be decomposed into two orthogonal components; specifically, each f j , can be decomposed to a function f 0 j in the linear subspace spanned by {K (x i , ·)} n i=1 and f 1 j orthogonal to f 0 j [3]. Accordingly, f j can be represented by The equality is achieved if and only if || f 1 Denote the K as a n × n matrix of the kernel estimation with respect to all the data samples X, and as a n × L matrix of the coefficients. The solution can be represented by Therefore, the problem in Eq. (11) is reduced to optimizing over the finite dimensional space of coefficients . According to [3], the kernel function K (·, ·) must be positive semi-definite which gives rise to an RKHS. A choice of the kernel function is the heat kernel, which can be approximated using a sharp Gaussian kernel. Thus, U in Eq. (5) can be taken as the kernel matrix K.

Reliance weighted kernel for performance improvement
In the framework of manifold regularization, the classifier is trained using both the labeled training set D l and the unlabeled training set D u . Although both D l and D u contribute to the classification, the prediction of the label vectorỹ of an unforeseen future samplex is based on the label information provided by the labeled training set D l . Naturally, this motivates us to have more trust in the labeled training set than the unlabeled one for out-of-sample prediction. Thus, a reliance weighting strategy is proposed to assign different weights to the training instances allowing samples from D l to have greater influence than those from D u . Given a heat kernel function K (x i , x), the weighted kernel function for x is where Ξ i represents the reliance weight of the ith instance. Denote theK as the matrix of the weighted kernel estimation with respect to all the data samples X, and the reliance weight matrix Ξ as Then, the weighted kernel matrix isK = KΞ . To yield to the minimizer in (13), the kernel functionK (·, ·) must be positive semi-definite.

Proof Given an arbitrary vector
where v i and v j are the ith and jth elements of v. The kernel estimation based on a heat kernel function is always nonneg- As a conclusion, the weighted kernel K (·, ·) = K (·, ·) · Ξ i is positive semi-definite if and only if Ξ i ≥ 0.
Using the reliance weighted kernel function instead of the heat kernel function, the solution in (14) becomes The coefficient matrix * can be estimated by differentiating the right hand side of (11) as The coefficient matrix is eventually obtained as * = where I is a n × n identity matrix. For unforeseen future samplesX = [x 1 ,x 2 , . . . ,x e ] T in D e , the label matrixF is obtained as follows: first, a e × n kernel matrix K e is calculated using Eq. (5), i.e.,K i j = 1, 2, . . . , e and j = 1, 2, . . . , n. Next, the outputF forX can be calculated as Eventually, the label matrixỸ ofX is obtained by comparing each element ofF with 0. We will henceforth refer to our multi-label extension of MR as Multi-Label Manifold Regularization (ML-MR), and our reliance weighting augmentation as ML-MR with Reliance Weighting (ML-MRRW).
There are clearly many strategies for determining reliance weights. The simplest strategy is to assign uniform weights, namely Ξ i = ν 1 ∈ [0, 1], 1 ≤ i ≤ l and Ξ i = ν 2 ∈ [0, 1], l < i ≤ l + u for all labeled and unlabeled training instances, respectively. These two parameters then decide the balance of trust between labeled and unlabeled training data. The extended manifold regularization is supervised if ν 1 = 1 and ν 2 = 0 are used, and is unsupervised for the choice of ν 1 = 0 and ν 2 = 1. The relation ν 1 = ν 2 indicates that the impacts of D l and D u to label inference are equal, whereas ν 1 > ν 2 indicates that more weight is put on labeled instances D l than that on unlabeled instances D u . In this work, we are trying to improve the performance of manifold regularization by trusting labeled instances more, and thus the choices of ν 1 and ν 2 must follow two criterions, namely ν 1 = 1 and ν 1 > ν 2 > 0.

Experimental design
This section designs experiments to validate the effectiveness of the proposed ML-MR and ML-MRRW methods on some commonly used benchmark data sets. Other semi-supervised multi-label classification methods are tested as comparisons, across a range of performance metrics.

Data sets
Four public data sets from different domains are chosen for the experimental study. Table 1 presents the basic information about these data sets. The first data set "Emotions" [52] consists of sampled wave forms of sound clips generated from different genres of musical songs. Each instance is labeled with 6 emotions: amazed-surprised, happy-pleased, relaxing-calm, quiet-still, sad-lonely, and angry-aggressive. The second data set "Scene" [7] is a commonly used image data set with each image represented by a 294-dimension feature vector and labeled with six classes: beach, sunset, field, fall-foliage, mountain, and urban. The third data set "Yeast" [19] consists of micro-array expression data and phylogenetic profiles for 2107 genes. Each gene is associated with a set of functional classes, which are grouped into 14 functional categories. The last data set "mediamill" [46] consists of digital video achieves for the TREC Video Retrieval Evaluation (TRECVID) challenge. This data set contains 120 features and 101 annotation concepts. These data sets are already formatted, so no further pre-processing is needed.

Experiment setup
In each experiment, the data set is first partitioned into two parts: the training data and out-of-sample testing data occupy two thirds and one third of the whole data set, respectively. Then, the labels of a portion of the instances in the training data are omitted to construct labeled training data and unlabeled training data. The labeling rate η is drawn from {5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%}. For each labeling rate, experiments are conducted 100 times by randomly resampling the labeled training data, unlabeled training data, and out-of-sample testing data. The first three data sets "Emotions", "Scene", and "Yeast" are fully used in the experiments, whereas only a portion (10% randomly selected) of the "Mediamill" data is used in view of the computational complexity of MR.

Performance metrics
Many performance metrics or criteria for multi-label classification have been proposed; reviews may be found in [47] and [58]. In this work, three popular metrics are used to evaluate the performances of the algorithms in learning multi-label problems.
The average precision calculates the average fraction of labels ranked above a particular label that are truly predicted. The larger the value of it, the better the learning performance: where y i is the chosen particular label. y i j is the jth label of instance i. F1 is a popular measure for single label. It is the harmonic mean of precision and recall: where t p is the number of true positives, tn is the number of true negatives, f p is the number of false positives, and f n is the number of false negatives. Macro-F1 and Micro-F1 are multi-label classifier metrics derived by computing the F1 measure across the label set; either after summing true and false positives and false negatives across all labels, or by averaging the F1 measure for each label: where t p λ is the number of true positives, f p λ is the number of false positives, and f n λ is the number of false negatives of label λ after being evaluated by binary evaluation of F1. Larger values of F1 micro and F1 macro denote better performance.

Significance test
Statistical tests are commonly used to ensure that differences between machine-learning algorithms are meaningful [15,23,44]. In this paper, the Friedman test and a post hoc test are utilized. Friedman's Test is a simple and robust nonparametric method for testing the differences between multiple algorithms over multiple data sets. It ranks the algorithms from the smallest rank to the largest rank based on their performance scores for each data set separately, and average ranks are assigned to ties. For instance, the best performing algorithm is assigned rank 1, the second best performing algorithm is assigned rank 2, . . .. Denote R i as the sum of ranks for the ith algorithm (i = 1, 2, 3, . . . , K ) over N different data sets. Then, the Friedman's statistic F R [22,44] is given by The null hypothesis H 0 is that there are no significant differences between the algorithms, the alternative hypothesis H 1 is that there are significant differences between the algorithms. F R tests the null hypothesis H 0 against the alternative hypothesis H 1 . For K larger than 5, the distribution of F R can be approximated by a Chi-square distribution with K − 1 degree of freedom. Thus, for any pre-chosen α level of significance, the null hypothesis H 0 is rejected if F R > χ 2 α . In this paper, there are 7 algorithms applied to the first three data sets, so K − 1 = 6. Thus, the critical Chi-square value is χ 2 α = 12.592 given α = 0.05. There are six algorithms carried out to the last data set, namely Mediamill, so K − 1 = 5. Thus, the critical Chi-square value is χ 2 α = 11.070 given α = 0.05.
When the null hypothesis is rejected, the analysis continues with a post hoc test [44]. Denote the difference D i j = R i − R j between the rank sums of algorithms i and j. The performance of two algorithms is significantly different if the difference |D i j | between their corresponding rank sums is no less than the critical difference where z is the z-score from the standard normal curve corresponding to α K (K −1) , and α is the level of significance. It can be concluded that the performance of the algorithm i is significantly better than that of the algorithm j, if |D i j | ≥ C D and D i j < 0; otherwise, worse, if |D i j | ≥ C D and D i j > 0.  Fig. 1 Performance metrics vs. labeling rates for seven classification algorithms applied to the "Emotions" data

Experimental results and discussion
We compare the proposed ML-MR and ML-MRRW against four well-known semi-supervised, and one supervised, multilabel algorithms on the chosen data sets. When calculating the Friedman's statistic test and post hoc statistic test for each data set, the ten sampled data sets under each labeling rate (from 5 to 50%) are considered as different data sets.

Case I: Emotions
The experimental results for the "Emotions" data are shown in Fig. 1. The sub-figures from left to right present the Aprecision (A-precision stands for average precision), Micro-F1, and Macro-F1 for all the algorithms under different labeling rates, respectively. The error bars indicate one standard deviation of the metrics. Table 2 presents the calculated Friedman's statistics F R based on ranking scores for the three different performance metrics; all of them are greater than the critical Chi-square value χ 2 α = 12.592. Thus, the null hypothesis is rejected, and it can be concluded that there are significant differences between the performances of the seven algorithms.  Further, post hoc test is carried out. The differences between the rank sums of the ML-MRRW and the other algorithms are calculated and presented in Table 3. Denote MLkNN, ML-GFHF, ML-LGC, ML-FSKSC, SSWL, ML-MR, and ML-MRRW by algorithms 1, 2, 3, 4, 5, 6, and 7, respectively. Then, D 7i , i = 1, 2, . . . , 6 represents the difference between rank sums of the ML-MRRW and the ith algorithm. The critical difference for K = 7 and α = 0.05 is C D = 9.2815. For each performance metric, any difference value |D 7i | ≥ C D indicates a significant difference between ML-MRRW and the algorithm i with respect to this metric. Further, |D 7i | ≥ C D and D 7i < 0 indicate ML-MRRW outperforms the algorithm i. From Table 3  The values in the brackets denote the labeling rates of the data used by ML-MRRW Table 5 Comparison with supervised multi-label ensemble algorithms in [37]  In general, the following conclusions can be drawn from the plots and tables: 1. SSWL does not work well under low labeling rates, however, it improves the performance very much as labeling rate increases. It works almost the same as MLkNN as labeling rate higher than 30%. The other five semisupervised multi-label learning algorithms show much better overall performances compared to the MLkNN and SSWL methods, except that ML-FSKSC has lower Aprecision for large labeling rates. Moreover, ML-MRRW is also compared with supervised multi-label algorithms from the state-of-the-art literature [31], and supervised multi-label ensemble algorithms in [37] on the "Emotions" data in Tables 4 and 5, respectively. The performance metrics include the mean values of A-precision, Micro-F1, and Macro-F1. The second last column presents the three metrics achieved by ML-MRRW under the labeling rate of 50% (also shown in Fig. 1). It can be found that ML-MRRW under this labeling rate outperforms most algorithms in terms of A-precision, Micro-F1, and Macro-F1. It also outperforms some ensemble algorithms, including M L S train , HOMER, AdaB.MH, TREMLC, and CBMLC, and it does almost as well as the other ensemble methods in Table 5 under the 50% labeling rate. The last column presents the metrics as the labeling rate increases to 70%; at this labeling rate, ML-MRRW is found to outperform all of the baselines in both Tables 4  and 5.

Case II: Scene
The experimental results for the "Scene" data are shown in Fig. 2. Table 6 presents the calculated Friedman's statistics F R according to ranking scores for the three different performance metrics. It can be found that all of them are greater than the critical Chi-square value χ 2 α = 12.592. Thus, the null hypothesis is rejected, and it can be concluded that there are significant differences between the performances of the seven algorithms. Further, the differences between the rank sums of the ML-MRRW and the other algorithms are calculated and presented in Table 7. From Table 7 Moreover, ML-MRRW is also compared with supervised multi-label algorithms from the state-of-the-art literature [31] and supervised multi-label ensemble algorithms in [37] on the "Scene" data in Tables 8 and 9, respectively. The second last column presents the mean values of A-precision, Micro-F1, and Macro-F1 for ML-MRRW under the labeling rate 50% (also shown in Fig. 2). From Table 8  including M L S train , HOMER, AdaB.MH, and CBMLC, and it does almost as well as the other ensemble methods in Table 9. The last column presents the metrics as the labeling rate increases to 90%; at this level, ML-MRRW is found to outperform all the baselines in both Tables 8 and 9.

Case III: Yeast
The experimental results for the "Yeast" data are shown in Fig. 3.  Fig. 3 Performance metrics vs. labeling rates for seven classification algorithms applied to the "Yeast" data square value χ 2 α = 12.592. Thus, the null hypothesis is rejected, and it can be concluded that there are significant differences between the performances of the 7 algorithms. Further, the differences between the rank sums of the ML-MRRW and the other algorithms are calculated and presented in Table 11. From Table 11  In general, the following conclusions can be drawn from the plots and tables: 1. SSWL does not work well under low labeling rates, but it improves the performance a lot as labeling rate increases. Furthermore, it outperforms the other methods with labeling rate higher than 15% in terms of Macro-F1. Moreover, ML-MRRW is also compared with supervised multi-label algorithms from the state-of-the-art literature [31] and supervised multi-label ensemble algorithms in [37] on the "Yeast" data in Tables 12 and 13, respectively. The second last column presents the mean values of the Aprecision, Micro-F1, and Macro-F1 for ML-MRRW under the labeling rate 50% (also shown in Fig. 3). From Table 12, ML-MRRW under this labeling rate outperforms all the algorithms in terms of A-precision, outperforms ML-C4.5, PCT, ML-KNN, RFML-C4.5, and RF-PCT in terms of Micro-F1, and it outperforms all the algorithms except for HOMER in terms of Micro-F1. It also outperforms some ensemble algorithms, including EBR, M L S train , AdaB.MH, ELP, EPS, TREMLC, RF-PCT, and CBMLC, and it does almost as well as the other ensemble methods in Table 13. The last column presents the metrics as the labeling rate increases to 75%; at this level, ML-MRRW is found to outperform all the baselines in both Tables 12 and 13.

Case IV: Mediamill
The experimental results for the "Mediamill" data are shown in Fig. 4. Table 14 presents the calculated Friedman's statistics F R for the three different performance metrics. It can be found that all of them are greater than the critical Chi-square value χ 2 α = 11.070. Thus, the null hypothesis is rejected, and it can be concluded that there are significant differences between the performances of the six algorithms.
Further, the differences between the rank sums of the ML-MRRW and the other algorithms are calculated and presented in Table 15. Denote MLkNN, ML-GFHF, ML-LGC, ML-FSKSC, ML-MR, and ML-MRRW by algorithms 1, 2, 3, 4, 5, and 6, respectively. Then, D 6i , i = 1, 2, . . . , 5 represents the difference between rank sums of the ML-NRRW and the ith algorithm. The critical difference for K = 6 and α = 0.05 is C D = 7.7658. For each performance metric, any difference value |D 6i | ≥ C D indicates a significant difference between ML-MRRW and the algorithm i with respect to this metric. Further, |D 6i | ≥ C D and D 6i < 0 indicate ML-MRRW outperforms the algorithm i. From Table 15   Moreover, ML-MRRW is also compared with supervised multi-label algorithms from the state-of-the-art literature [31] and supervised multi-label ensemble algorithms in [37] on the "Mediamill" data in Tables 16 and 17, respectively. Note that these experiments in the literature consider the whole Mediamill data set, as opposed to a randomly selected subset (redrawn for each experimental run) as in our work. The second last column presents the mean values of the A-precision, Micro-F1, and Macro-F1 for ML-MRRW under the labeling rate 50% (also shown in Fig. 4). From Table 16, ML-MRRW under this labeling rate outperforms all algorithms in terms of the three metrics, except for RF-PCT in terms of A-precision. It is also superior to all the supervised ensemble algorithms in [37] from Table 17. The last column presents the metrics as the labeling rate increases to 65%; at this level, ML-MRRW is found to outperform all the baselines in both Tables 16  and 17.

Conclusion
This paper studies the semi-supervised multi-label classification problem, and extends the graph-based manifold regularization to the multi-label case. The proposed method includes three essential components, including the graph construction, the manifold regularization with multiple labels, and the exploitation of a reliance weighting strategy. This last component is intended to improve the learning ability by show that the proposed ML-MRRW algorithm has overall better performance than all the other algorithms under different labeling rates. In addition, ML-MRRW shows better performance than ML-MR, indicating the proposed reliance weighting strategy is effective in improving the learning performance of the ML-MR method. Further, unlike the other algorithms, ML-MRRW works consistently well on all the data sets. Also ML-MRRW is compared with 12 supervised multi-label algorithms and 12 ensemble approaches from the literature on the public data sets. As evidenced by the results, ML-MRRW outperforms all the baselines by supervised methods on these data sets. All in all, ML-MRRW is a promising semi-supervised multi-label algorithm for classification.