Semi-supervised multi-label classification using an extended graph-based manifold regularization

Li, Ding; Dick, Scott

doi:10.1007/s40747-021-00611-7

Semi-supervised multi-label classification using an extended graph-based manifold regularization

Original Article
Open access
Published: 04 January 2022

Volume 8, pages 1561–1577, (2022)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Semi-supervised multi-label classification using an extended graph-based manifold regularization

Download PDF

2385 Accesses
6 Citations
Explore all metrics

Abstract

Graph-based algorithms are known to be effective approaches to semi-supervised learning. However, there has been relatively little work on extending these algorithms to the multi-label classification case. We derive an extension of the Manifold Regularization algorithm to multi-label classification, which is significantly simpler than the general Vector Manifold Regularization approach. We then augment our algorithm with a weighting strategy to allow differential influence on a model between instances having ground-truth vs. induced labels. Experiments on four benchmark multi-label data sets show that the resulting algorithm performs better overall compared to the existing semi-supervised multi-label classification algorithms at various levels of label sparsity. Comparisons with state-of-the-art supervised multi-label approaches (which of course are fully labeled) also show that our algorithm outperforms all of them even with a substantial number of unlabeled examples.

Semi-supervised multi-label feature selection with local logic information preserved

Article 06 September 2021

Multi-label feature selection via feature manifold learning and sparsity regularization

Article 01 March 2017

Robust and sparse label propagation for graph-based semi-supervised classification

Article 05 July 2021

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In many real-world applications, such as bioinformatics and video annotation, obtaining labeled data is sometimes very difficult, expensive and time-consuming. On the other hand, it may be simple and inexpensive to obtain unlabeled data. For instance, vast numbers of videos and images are available on the web. The large amount of unlabeled data can reveal useful information about the phenomena we are studying, e.g., estimating the distribution of the data as well as the data structure [68]. As a result, Semi-Supervised Learning (SSL) is drawing increasing interest in the machine-learning community [10].

Studies on SSL are extensive (e.g. [2, 4, 12, 13, 32, 45, 51, 62, 66]); detailed reviews may be found in [65] and [42]. The common purpose of semi-supervised algorithms is to exploit both labeled data and unlabeled data to create superior classifiers compared to labeled data alone. According to [10], self-training (also known as self-learning or self-labeling) is among the earliest approaches that use unlabeled data in classification. The idea of the self-training first appeared in [41]. In self-training, a classifier is first trained only with the labeled data, and then used to predict labels for some unlabeled data. Then, the classifier is re-trained with both the ground-truth and predicted labels, and used to predict additional labels. The process repeats until all examples are labeled. The authors in [42] use the expectation-maximization (EM) algorithm [14] for SSL. Co-Training [6] is a learning paradigm to address problems with strong structural prior knowledge available, and is regarded as a variant of EM on the probabilistic model [10, 42]. It assumes that features can be split into two complementary and independent feature subsets and each feature subset is enough to train a classifier for the data. Then, each classifier uses its most confidently predicted points and their labels to teach the other classifier. The process of using the other classifier’s most confidently predicted labels to teach itself is iterated until some criteria is achieved. Transductive learning is another approach, based on the idea of performing predictions only for test samples [10]; Transductive Support Vector Machines (TSVM) are one example [54]. Various extensions to the TSVM have been proposed [9, 11, 16, 60]; the common point is that the algorithms try to learn a hyperplane over the labeled data and the unlabeled data by optimizing a tradeoff between maximizing the margin over the labeled data and regularizing the decision boundary over low-density regions of all data samples.

Graph-based algorithms are an important sub-class of SSL that have recently attracted considerable attention [10, 48, 49]. Various graph-based SSL algorithms have been developed [3, 5, 25, 28, 53, 55, 56, 59, 64, 67] and a number of successful applications can be found in recent publications [1, 29, 30, 61]. Some popular graph-based algorithms include Local and Global Consistency [64], Gaussian Random Fields and Harmonic Functions [67], mincuts [5], greedy max-cut [55], and spectral graph transducers [28]. All the graph-based algorithms begin by constructing a graph with nodes representing data points, and edges representing similarity between the connected nodes. The labeled data points are then used to perform graph clustering or propagate labels from labeled points to unlabeled points, by minimizing the empirical cost over labeled data and regularizing the smoothness over the graph using all the data. Another representative SSL approach is manifold regularization [3], which assumes data points lie on a low-dimensional manifold in the input space [20, 35, 50].

At the same time, most above semi-supervised classification algorithms implicitly assume that class labels are mutually exclusive. However, in many application domains, such as image classification, bioinformatics and news categorization, each instance can represent more than one concept simultaneously; this is best represented as a vector of labels. In addition, human emotions and sentiments are sometimes regarded as a multi-label classification problem nowadays, e.g., multiple fine-grained emotions may coexist in a single tweet of a microblog [21]. In addition, multi-label classifiers have recently been utilized for recognizing crop diseases in agriculture [27]. The learning algorithms for these problems are the “multi-label classifiers” as reviewed in [47, 58]. For instance, a well-known multi-label classifier is the Multi-Label k Nearest Neighbors (MLkNN) [57], which is an extension of the classical kNN method. References [31, 37], and [39] study a variety of supervised multi-label algorithms and present extensive experiments to compare their performances.

Our focus in the current paper is the intersection of these two problems, to wit, the design of semi-supervised multi-label classifiers. There is relatively less work in the literature on this sub-problem, and a particular dearth of graph-based semi-supervised algorithms for the multi-label case. Some existing studies on semi-supervised algorithms include the Multi-Label Gaussian Fields and Harmonic Functions (ML-GFHF) [56], the Multi-Label Local and Global Consistency (ML-LGC) [56], the Fixed-Size Multi-Label Regularized Kernel Spectral Clustering (ML-FSKSC) [33], and the Semi-Supervised Weak-Label approach (SSWL) [18]. In spite of these results, the opportunities in this area are extensive. Better methods are needed for semi-supervised multi-label classification in many tasks.

In our previous work [29], we found that a multi-label extension of the Manifold Regularization algorithm [3] was quite effective for non-intrusive load monitoring. In the current paper, we seek to improve upon that algorithm, and determine how well our results generalize beyond that domain. We investigate a multi-label extension of the Manifold Regularization (MR) algorithm, augmented with a reliance weighting strategy to further improve classification performance. Reliance weights allow learning algorithms to differentiate between ground-truth and induced labels in constructing a classifier for a given data set. They take the form of an additional matrix term in the kernel expansion of the Laplacian Regularized Least Squares model learned in MR [3]. We evaluate our proposed algorithm in comparison with five other multi-label algorithms (four semi-supervised algorithms plus MLkNN), on a set of four benchmark data sets.

The key contributions of this work are:

The manifold regularization algorithm is extended to learn multi-label classifiers.
A weighting strategy is proposed to vary the trust placed in labeled and unlabeled instances when forecasting labels for unseen points.
The proposed approach is compared against four semi-supervised, and one fully supervised, multi-label algorithms, and performs as well as or better than all of them.

The advantages of the proposed method are threefold: (1) the proposed method performs as well or better than the existing semi-supervised multi-label algorithms on the four data sets in the fifth section. It furthermore outperforms the state-of-the art supervised multi-label algorithms (which of course are trained on fully labeled data), even when a substantial portion of the training set is unlabeled. (2) The proposed method has a low model complexity as the Manifold Regularization [3] assumes data points lie on a low-dimensional manifold in the input space. (3) The proposed reliance weighting strategy allows an analyst to specify different trust levels for ground-truth and induced labels. The disadvantage of the method mainly lies in the computational time required for the construction of the graph structure; this is a common problem in this class of algorithms.

The remainder of this paper is organized as follows: the next section presents the preliminaries, including introducing the basis and notations, regularization in reproducing Kernel Hilbert space and manifold regularization. The third section presents the proposed approach, including graph construction, manifold regularization with multiple labels and our reliance weighting strategy. The fourth section describes the experimental design including introducing the data sets, experimental setup, performance metrics and statistical significance tests. The fifth section presents our experimental results and discussion, and we offer a summary and discussion of future work in the last section.

Preliminaries

This section presents the notations and basics that are used throughout the paper, and reviews the manifold regularization algorithm.

Basics and notations

In the framework of semi-supervised learning, the data set $\mathbb {D}$ in the training phase consists of two parts, namely $\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u$, where $\mathbb {D}_l$ and $\mathbb {D}_u$ indicate the labeled and unlabeled training data sets, respectively. Both $\mathbb {D}_l$ and $\mathbb {D}_u$ are drawn from the same distribution $p(\mathbf {x})$, where $\mathbf {x}$ indicates a feature variable. In the single label case, the feature space and label space of a data set $\mathbb {D}$ are denoted by $\mathcal {X}=\mathbb {R}^d$ and $\mathcal {Y}=\{-1,1\}$, respectively. Then, the labeled and unlabeled training data sets are represented by $\mathbb {D}_l=\{(\mathbf {x}_i,y_i):\mathbf {x}_i\in \mathcal {X},y_i\in \mathcal {Y}, i=1,2,\ldots ,l\}$ and $\mathbb {D}_u=\{\mathbf {x}_i: \mathbf {x}_i\in \mathcal {X}, i=l+1,l+2,\ldots ,l+u\}$, where l and u indicate the numbers of labeled and unlabeled instances $\mathbf {x}_i=[x_{i1},x_{i2},\ldots ,x_{id}]^T$ for $i=1,2,\ldots ,n$, where d indicates the feature dimension. The total number of all training instances in $\mathbb {D}$ is $n=l+u$. The goal of semi-supervised learning with single label is to infer the labels ${\tilde{Y}}=\{{\tilde{y}}_i\in \mathcal {Y},i=1,2,\ldots ,e\}$ for future instances $\mathbb {D}_e= \{{{\tilde{\mathbf{x}}}}_i \in \mathcal {X},i=1,2,\ldots ,e\}$ given the training data set $\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u$. [49, 68]

In the multi-label case, the label space of $\mathbb {D}$ is denoted by $\mathcal {Y}=\{-1,1\}^L$, where L indicates the number of labels. Analogously, the labeled training data set becomes $\mathbb {D}_l= \{(\mathbf {x}_i, \mathbf {y}_i): \mathbf {x}_i\in \mathcal {X},\mathbf {y}_i\in \mathcal {Y}, i=1,2,\ldots ,l\}$ and the label vector is $\mathbf {y}_i=[y_{i1},y_{i2},\ldots ,y_{iL}]^T$, whereas the other notations remain the same as the single label case. The goal of semi-supervised learning with multiple labels is to infer the labels $ {{\tilde{\mathbf{Y}}}}=\{{{\tilde{\mathbf{y}}}}_i\in \mathcal {Y},i=1,2,\ldots ,e\}$ for $\mathbb {D}_e=\{{{\tilde{\mathbf{x}}}}_i\in \mathcal {X},i=1,2,\ldots ,e\}$ given $\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u$.

Using the graph-based semi-supervised learning, a crucial step is to construct a graph $\mathcal {G}=(V,E)$ representing the connections between training instances $\mathbf {x}_i\in \mathcal {X}$ [49, 56, 68]. Specifically, $\mathcal {G}=(V,E)$ has n vertices $V_i$ and each vertex $V_i$ represents an instance $\mathbf {x}_i,i=1,2,\ldots ,n$. $E_{ij}$ is an edge connecting vertices $V_i$ and $V_j$. There are three typical methods to construct such a graph, including the k nearest neighbor algorithm, $\varepsilon $ distance measure and full connection. For example, using the k nearest neighbor algorithm, each edge $E_{ij}$ connects the vertices $V_i$ and $V_j$ if vertex $V_i$ is among the k nearest neighbors of vertex $V_j$, or vertex $V_j$ is among the k nearest neighbors of vertex $V_i$. A weight matrix $\mathbf {W}$ is defined over the graph $\mathcal {G}=(V,E)$, where $W_{ij}$ is the weight associates with edge $E_{ij}$ representing the similarity between vertices $V_i$ and $V_j$ (namely the training instances $\mathbf {x}_i$ and $\mathbf {x}_j$). Then, the unnormalized graph Laplacian is given by $\mathbf {L} = \mathbf {D}-\mathbf {W}$, where $\mathbf {D}$ is a diagonal matrix with $D_{ii}=\sum _{j=1}^{N} W_{ij}$.

The label inference in graph-based SSL is usually based on two graph assumptions [56, 68]: (1) the prediction should be close to the given labels on labeled vertices; (2) the prediction should be smooth on the whole graph (i.e., vertices that are close in the graph tend to have the same labels). The label inference algorithms for graph-based SSL can be categorized into two major classes: transductive learning (e.g., the graph Laplacian regularization [64, 67]), and inductive learning (e.g., the manifold regularization [3]). Transductive learning infers labels only on the unlabeled training data and cannot make predictions on out-of-sample data. By contrast, inductive learning infers labels for the whole domain, i.e., a function $f:\mathcal {X}\rightarrow \mathcal {Y}$ is learned given $\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u$ and then the labels for $\mathbb {D}_e$ are predicted. The work in this paper is based on the manifold regularization [3], which is a typical inductive learning method [63]. The next subsection revisits regularization in a reproducing kernel Hilbert space, which is the core of manifold regularization.

Regularization in reproducing kernel Hilbert space

For a Mercer kernel $K:\mathcal {X}\times \mathcal {X} \rightarrow \mathbb {R}$, there exists an associated Reproducing Kernel Hilbert Space (RKHS) $\mathcal {H}_K$ of functions $\mathcal {X} \rightarrow \mathbb {R}$ with the norm $||\cdot ||_K$ [40]. The standard supervised learning estimates an unknown function $f\in \mathcal {H}_K$ from the labeled data set $\mathbb {D}_l$ as

$$\begin{aligned} f^*=\mathop {\mathrm{arg\,min}}\limits _{f\in \mathcal {H}_K }\frac{1}{l}\sum _{i=1}^{l}V(\mathbf {x}_i,y_i,f)+\gamma _A ||f||_K^2, \end{aligned}$$

(1)

where $V(\mathbf {x}_i,y_i,f)$ is the loss function, such as the squared error loss $(y_i-f(\mathbf {x}_i))^2$ for regularized least squares (RLS). $||f||_K^2$ is a regularization term in the RKHS imposing the smoothness condition on possible solutions. $\gamma _A$ balances the tradeoff between the empirical cost and the regularization term. l is the number of labeled instances.

The difference between semi-supervised learning to supervised learning lies in the utilization of the marginal distribution of $\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u$ to improve the learning performance in addition to the empirical cost obtained over the labeled data set $\mathbb {D}_l$. According to the discussions in [3], there is an identifiable relation between marginal distribution $p(\mathbf {x})$ and conditional distribution $p(y|\mathbf {x})$, i.e., if two instances $\mathbf {x}_i,\mathbf {x}_j\in \mathcal {X}$ are close in the intrinsic geometry of $p(\mathbf {x})$, then their conditional distributions $p(y|\mathbf {x}_i)$ and $p(y|\mathbf {x}_j)$ are similar. Thus, another regularization term can be added to ensure that the solution is smooth with respect to the marginal distribution $p(\mathbf {x})$. Incorporating the smoothness penalty term with respect to the graph Laplacian $\mathbf {L}$, we derive the following optimization problem [3]:

$$\begin{aligned} f^* =\mathop {\mathrm{arg\,min}}\limits _{f\in \mathcal {H}_K }\frac{1}{l}\sum _{i=1}^{l}V(\mathbf {x}_i,y_i,f)+\gamma _A ||f||_K^2+\frac{\gamma _I}{n^2}\mathbf {f}^T\mathbf {L}\mathbf {f},\nonumber \\ \end{aligned}$$

(2)

where $\mathbf {f}=[f(\mathbf {x}_1),f(\mathbf {x}_2),\cdots ,f(\mathbf {x}_n)]^T$, and $\mathbf {f}^T\mathbf {L}\mathbf {f}$ is a penalty term that reflect the intrinsic structure of the probability distribution $p(\mathbf {x})$. $n=u+l$ is the number of total instances. The normalizing coefficient $\frac{1}{n^2}$ is the natural scale factor for the empirical estimate of the Laplace operator. Coefficients $\gamma _A$ and $\gamma _I$ controls the complexity of the function in the ambient space and the intrinsic geometry of the $p(\mathbf {x})$ respectively. In real-world data sets, $p(\mathbf {x})$ is unknown, but an empirical estimate can be obtained from a sufficiently large amount of unlabeled data $\mathbb {D}_u$ by assuming the data set lies on a manifold in $\mathbb {R}^d$ and modeling the manifold with the adjacency graph $\mathcal {G}=(V,E)$ from the data set $\mathbb {D}$. According to the classical Representer Theorem [40], the solution to Eq. (2) in $\mathcal {H}_K$ is given by Ref. [3]

$$\begin{aligned} f^*(\mathbf {x})=\sum _{i=1}^{l+u}\theta _i K(\mathbf {x}_i,\mathbf {x}), \end{aligned}$$

(3)

which is an expansion of the Representer Theorem in terms of labeled data and unlabeled data $\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u$. Accordingly, the problem is essentially an optimization problem over the space of coefficients $\theta _i$.

The RKHS has been extended to vector-valued functions [8] to formulate the vector-valued manifold regularization [35]. Let $\mathbf {F} = (f_1(\mathbf {x}_1),\cdots ,f_n(\mathbf {x}_n)) \in \mathcal {Y}^n$ be components of a vector-valued function where each $f_i \in H_K$ [35]. Here $\mathcal {Y}$ can be $\mathbb {R}$ for the single label case or $\mathbb {R}^L$ for multi-label case. The optimization problem of the vector-valued manifold regularization is given by Ref. [35]

$$\begin{aligned} f^*= & {} \mathop {\mathrm{arg\,min}}\limits _{f\in \mathcal {H}_K } \frac{1}{l}\sum _{i=1}^{l}V(\mathbf {x}_i,\mathbf {y}_i,f) + \gamma _A ||f||_K^2 \nonumber \\&+ \gamma _I <\mathbf {F},M\mathbf {F}>_{\mathcal {Y}^n}, \end{aligned}$$

(4)

where the matrix M is a symmetric, positive operator, such that $<y,My>_{\mathcal {Y}^n}$ for all $y\in \mathcal {Y}^n$. $\mathcal {Y}^n$ is the n-direct product of $\mathcal {Y}$, with the inner product

$$\begin{aligned}<(y_1,\cdots ,y_n),(w_1,\ldots ,w_n)>_{\mathcal {Y}^n} = \sum _{i=1}^n<y_i,w_i>_{\mathcal {Y}}. \end{aligned}$$

It has been proved in [35] that the minimization problem in (4) has a unique solution taking the form $f^*(\mathbf {x})=\sum _{i=1}^{l+u}K(\mathbf {x}_i,\mathbf {x})\varvec{\Theta }_i$ for some vectors $\varvec{\Theta }_i\in \mathcal {Y}, 1\le i \le n$. The vector-valued manifold regularization is a generalized form of manifold regularization, and can be used for single label, multi-label, and multi-view learning [35, 36].

The Representer Theorem in the vector-valued RKHS is given and proved in [35]. Let $\mathcal {H}_{K,\mathbf {x}} = \{\sum _{i=1}^{u+l}K(\mathbf {x}_i,\mathbf {x})y_i, \mathbf {y}\in \mathcal {Y}^{u+l}\}$. For $f\in \mathcal {H}_{K,\mathbf {x}}^\bot $, the sampling operator $S_{\mathbf {x}}$ satisfies $<S_\mathbf {x}f, \mathbf {y}>_{\mathcal {Y}^{u+l}} = <f,\sum _{i=1}^{u+l}K(\mathbf {x}_i,\mathbf {x})y_i>_{\mathcal {H}_K}=0$. This holds true for all $\mathbf {y} \in \mathcal {Y}^{u+l}$ and yields $S_\mathbf {x}f=(f(\mathbf {x}_1),\ldots , f(\mathbf {x}_{u+l}))=0$. Denote the right-hand side of (4) by I(f). Any arbitrary $f\in \mathcal {H}_K$, can be decomposed orthogonally as $f=f_0+f_1$, with $f_0\in \mathcal {H}_{K,\mathbf {x}}$ and $f_1 \in \mathcal {H}_{K,\mathbf {x}}^\bot $. This results in $I(f)=I(f_0+f_1)\ge I(f_0)$ with equality if and only if $||f_1||_{\mathcal {H}_K}=0$, since $||f_0+f_1||_{\mathcal {H}_K}=||f_0||_{\mathcal {H}_K}+||f_1||_{\mathcal {H}_K}$. As a result, the minimizer of (4) must lie in $\mathcal {H}_{K,\mathbf {x}}$.

The proposed method

The work in [3] initially proposed the manifold regularization, and showed that the Representer Theorem minimizes the error for Laplacian RLS in univariate cases; further, reference [35] proved the Representer Theorem for the general cases of the vector manifold regression. Following the two fundamental theoretical works, this work on multi-label manifold regularization is essentially an important special case of the theorem in [35]. In the existing literature, there is no study on such a special case; in particular, no simpler proof has been advanced that the kernel coefficients in Eq. (3) remain a solution to the Laplacian RLS minimization. We are following a long tradition in mathematics where simpler proofs for interesting special cases remain valuable, even if the general case has been proven. For instance, Dirichlet’s theorem was first proved in [17] in the 19th century. Nonetheless, studies of special cases of Dirichlet’s theorem, especially those having elementary proofs (e.g., [24, 38, 43]), continue to this day [34]. Analogously, studying the multi-label classification case of MR also seems an interesting and novel contribution. We also introduce the reliance weighting strategy, and prove that our modified algorithm remains a solution to the Laplacian RLS problem. The major challenges include: (1) the formulation of the optimization problem of manifold regularization with multiple labels given that the data structure is different from the single-labeled data, (2) the solving of the optimization problem to guarantee that a unique global solution exists, (3) the derivation of the solution by including a reliance weight matrix.

Graph construction

Given the whole data set $\mathbb {D}=\mathbb {D}_l\cup \mathbb {D}_u$, a full $n\times n$ distance matrix $\mathbf {U}$ is calculated between each pair of instances $\mathbf {x}_i, \mathbf {x}_j\in \mathcal {X}$ based on a Gaussian kernel $K(\mathbf {x}_i, \mathbf {x}_j)$ as

$$\begin{aligned} U_{ij} = K(\mathbf {x}_i, \mathbf {x}_j) = \exp \left( -\frac{|| \mathbf {x}_i -\mathbf {x}_j ||^2}{2\sigma ^2} \right) , \end{aligned}$$

(5)

where $\sigma $ denotes the bandwidth of the Gaussian kernel. Equivalently, an alternative distance matrix $\mathbf {H}$ can be calculated with each element $H_{ij}$ given by Refs. [26, 55]

$$\begin{aligned} H_{ij}=\sqrt{U_{ii}+U_{jj}-2U_{ij}}. \end{aligned}$$

(6)

The constructed graph $\mathcal {G}=(V,E)$ is a fully connected graph with each edge $E_{ij}$ weighted by $H_{ij}$. According to [26, 55], graph sparsification can improve the efficiency of label inference. Edges are removed producing an $n\times n$ binary matrix $\mathbf {B}$ with 1’s and 0’s representing the presence and absence of connections, respectively. Three sparsification approaches can be used, including the $\varepsilon $-neighbor search, k-nearest neighbor search, and the b-matching [26, 55]:

1.
The $\varepsilon $-neighbor search recovers a binary matrix $\mathbf {B}$ as
$$\begin{aligned} B_{ij} = \left\{ \begin{array}{ccl} 1 &{} \text{ if } &{} 1-H_{ij}\le \varepsilon \\ 0 &{} \text{ if } &{} 1-H_{ij}> \varepsilon \text{ or } i=j \end{array}\right. . \end{aligned}$$
(7)
2.
The k-nearest neighbor search obtains the binary matrix $\mathbf {B}$ by minimizing the following optimization problem:
$$\begin{aligned} \begin{aligned}&\min _{\mathbf {B}\in \{0,1\}^{n\times n}} \sum _{i=1}^{n}\sum _{j=1}^{n} B_{ij} H_{ij} \\&\text {s.t. } \sum _{j=1}^{n}B_{ij}=k,B_{ii}=0,\forall i,j=1,\ldots ,n. \end{aligned} \end{aligned}$$
(8)
3.
Using the b-matching algorithm, the optimization problem to recover $\mathbf {B}$ is
$$\begin{aligned} \begin{aligned}&\min _{\mathbf {B}\in \{0,1\}^{n\times n}} \sum _{i=1}^{n}\sum _{j=1}^{n} B_{ij} H_{ij} \\&\text {s.t. } \sum _{j=1}^{N}B_{ij}=b,B_{ii}=0,B_{ij}=B_{ji},\forall i,j=1,\ldots ,n. \end{aligned}\nonumber \\ \end{aligned}$$
(9)

The binary matrix $\mathbf {B}$ obtained using the k-nearest neighbor search is not symmetric; thus the final $\mathbf {B}$ can be calculated as $B_{ij}=\max (B_{ij},B_{ji})$. By contrast, the b-matching algorithm produces a graph with every node having the same number of neighbors, namely $\mathbf {B}=\mathbf {B}^T$. Whichever of the above methods is applied, the weight for edge $E_{ij}$ is set to 0 if $B_{ij}=0$. For an edge $E_{ij}$ with $B_{ij}=1$, the weight $W_{ij}$ can be calculated with respect to the distance matrix $\mathbf {H}$ and expressed as

$$\begin{aligned} W_{ij}=H_{ij}B_{ij}. \end{aligned}$$

(10)

The final graph $\mathcal {G}=(V,E)$ is then constructed and represented by a sparse weight matrix $\mathbf {W}$. Proceeding to label inference, the graph Laplacian is calculated as $\mathbf {L} = \mathbf {D}-\mathbf {W}$, where each element of $\mathbf {D}$ is $D_{ii}=\sum _{j=1}^{N} W_{ij}$ and $D_{ij}=0$.

Manifold regularization with multiple labels

In this subsection, we extend the manifold regularization in [3] to solve multi-label learning problems. Let $\mathbf {X}=[\mathbf {x}_1,\mathbf {x}_2,\ldots , \mathbf {x}_n]^T$ and $\mathbf {Y}=[\mathbf {y}_1,\mathbf {y}_2,\ldots ,\mathbf {y}_n]^T$ denote the matrix of all feature instances and label instance. In $\mathbf {Y}$, $\mathbf {y}_i$ for $i\le l$ takes 1 or $-1$ for its elements and $\mathbf {y}_i$ is an all-zero vector for $l<i\le n$. In the framework of the Laplacian Regularized Least Squares (LapRLS) [3], the optimization problem of manifold regularization with multiple labels is

$$\begin{aligned} f^*= & {} \mathop {\mathrm{arg\,min}}\limits _{f_j\in \mathcal {H}_K,j=1,\ldots ,L } \frac{1}{l} {\text {tr}} \left( (\varvec{\Psi } \mathbf {F}-\mathbf {Y})^T (\varvec{\Psi } \mathbf {F}-\mathbf {Y}) \right) \nonumber \\&+ \gamma _A ||f||_K^2 + \frac{\gamma _I}{n^2} {\text {tr}} \left( \mathbf {F}^T\mathbf {L}\mathbf {F} \right) , \end{aligned}$$

(11)

where $\mathbf {F}=[f_j(\mathbf {x}_i)]_{n\times L}, i=1,\ldots ,n, j=1,\ldots , L$ is a matrix representing the predicted outputs, ${\text {tr}}(\cdot )$ denotes the trace of a matrix, and $\varvec{\Psi }$ is a $n\times n$ diagonal matrix with the diagonal elements given by

$$\begin{aligned} \Psi _{ii}=\left\{ \begin{array}{ccl} 1 &{} \text{ for } &{} i \le l, \\ 0 &{} \text{ for } &{} l < i \le n. \end{array}\right. . \end{aligned}$$

(12)

The second term $||f||_K^2 = \sum _{j=1}^{L}||f_j||_K^2$ in Eq. (11) measures the complexity of $\mathbf {F}$ in the ambient space. The third term represents the intrinsic smoothness with respect to the geometric distribution. $\mathbf {L}$ is the graph Laplacian obtained in the graph construction phase. The optimization problem in (11) is essentially one natural extension of the LapRLS for multi-label cases as indicated in [35].

The minimization problem in Eq. (11) is guaranteed to have a unique global solution. The theorem for the solution in (11) are given and proved as follows.

Theorem 1

The minimizer of optimization problem in Eq. (11) admits an expansion

$$\begin{aligned} f_j^*(\mathbf {x}) = \sum _{i=1}^{n} \Theta _{ij}K(\mathbf {x}_i,\mathbf {x}), j=1,2,\ldots , L \end{aligned}$$

(13)

in terms of the labeled and unlabeled instances; $K(\cdot ,\cdot )$ represents the kernel function, which must be positive semi-definite.

Proof

In the multi-label classification problem (11), the norm of the function f can be represented by the sum of each function $f_j$ in the Reproducing Kernel Hilbert Space $\mathcal {H}_K$, i.e., $||f||_K^2 = \sum _{j=1}^{L}||f_j||_K^2$.

Any function in the RKHS $\mathcal {H}_K$ can be decomposed into two orthogonal components; specifically, each $f_j$, can be decomposed to a function $f_j^0$ in the linear subspace spanned by $\{ K(x_i,\cdot )\}_{i=1}^{n}$ and $f_j^1$ orthogonal to $f_j^0$ [3]. Accordingly, $f_j$ can be represented by

$$\begin{aligned} f_j = f_j^0 + f_j^1 = \sum _{i=1}^{n} \Theta _{ij}K(x_i,\cdot ) + f_j^1, \end{aligned}$$

Since $||f_j||_K^2=||f_j^0||_K^2+||f_j^1||_K^2\ge ||f_j^0||_K^2$, there is

$$\begin{aligned} ||f||_K^2&= \sum _{j=1}^{L}||f_j||_K^2 = \sum _{j=1}^{L}||f_j^0||_K^2\\&\quad + \sum _{j=1}^{L}||f_j^1||_K^2\ge \sum _{j=1}^{L}||f_j^0||_K^2 \end{aligned}$$

The equality is achieved if and only if $||f_j^1||_K^2=0$, $j=1,2,\ldots , L$. Therefore the minimizer must be $f_j^*(\mathbf {x}) = \sum _{i=1}^{n} \Theta _{ij}K(\mathbf {x}_i,\mathbf {x})$, $j=1,2,\ldots , L$. $\square $

Denote the $\mathbf {K}$ as a $n\times n$ matrix of the kernel estimation with respect to all the data samples $\mathbf {X}$, and $\varvec{\Theta }$ as a $n\times L$ matrix of the coefficients. The solution can be represented by

$$\begin{aligned} \mathbf {F}^* = \mathbf {K}\varvec{\Theta }. \end{aligned}$$

(14)

Therefore, the problem in Eq. (11) is reduced to optimizing over the finite dimensional space of coefficients $\varvec{\Theta }$. According to [3], the kernel function $K(\cdot ,\cdot )$ must be positive semi-definite which gives rise to an RKHS. A choice of the kernel function is the heat kernel, which can be approximated using a sharp Gaussian kernel. Thus, $\mathbf {U}$ in Eq. (5) can be taken as the kernel matrix $\mathbf {K}$.

Reliance weighted kernel for performance improvement

In the framework of manifold regularization, the classifier is trained using both the labeled training set $\mathbb {D}_l$ and the unlabeled training set $\mathbb {D}_u$. Although both $\mathbb {D}_l$ and $\mathbb {D}_u$ contribute to the classification, the prediction of the label vector ${{\tilde{\mathbf{y}}}}$ of an unforeseen future sample ${{\tilde{\mathbf{x}}}}$ is based on the label information provided by the labeled training set $\mathbb {D}_l$. Naturally, this motivates us to have more trust in the labeled training set than the unlabeled one for out-of-sample prediction. Thus, a reliance weighting strategy is proposed to assign different weights to the training instances allowing samples from $\mathbb {D}_l$ to have greater influence than those from $\mathbb {D}_u$. Given a heat kernel function $K(\mathbf {x}_i,\mathbf {x})$, the weighted kernel function for $\mathbf {x}$ is

$$\begin{aligned} {\tilde{K}}(\mathbf {x}_i,\mathbf {x}) = K(\mathbf {x}_i,\mathbf {x}) \cdot \varXi _{i}, \end{aligned}$$

(15)

where $\varXi _{i}$ represents the reliance weight of the ith instance. Denote the ${{\tilde{\mathbf{K}}}}$ as the matrix of the weighted kernel estimation with respect to all the data samples $\mathbf {X}$, and the reliance weight matrix $\varvec{\varXi }$ as

$$\begin{aligned} \varvec{\varXi } = \left[ \begin{array}{cccc} \varXi _{1} &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad \varXi _{2} &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad \varXi _{n}\\ \end{array} \right] \end{aligned}$$

(16)

Then, the weighted kernel matrix is ${{\tilde{\mathbf{K}}}}=\mathbf {K}\varvec{\varXi }$. To yield to the minimizer in (13), the kernel function ${\tilde{K}}(\cdot ,\cdot )$ must be positive semi-definite.

Proposition 1

Given a heat kernel function $K(\cdot ,\cdot )$, the weighted kernel ${\tilde{K}}(\cdot ,\cdot )=K(\cdot ,\cdot ) \cdot \varXi _{i}$ is positive semi-definite if and only if $\varXi _{i}\ge 0$.

Proof

Given an arbitrary vector $\mathbf {v}\in \mathbb {R}^d$, we have

$$\begin{aligned} \mathbf {v}^T{{\tilde{\mathbf{K}}}}\mathbf {v} = \sum _{i=1}^{d}\sum _{j=1}^{d} K(\mathbf {x}_i,\mathbf {x}_j) \cdot \varXi _{i} \cdot v_i v_j. \end{aligned}$$

(17)

where $v_i$ and $v_j$ are the ith and jth elements of $\mathbf {v}$. The kernel estimation based on a heat kernel function is always nonnegative, namely $K(\mathbf {x}_i,\mathbf {x}_j)\ge 0$. Therefore, $K(\mathbf {x}_i,\mathbf {x}_j) \cdot \varXi _{i}\ge 0$ if and only if $\varXi _{i}\ge 0$. Accordingly, $\mathbf {v}^T{{\tilde{\mathbf{K}}}}\mathbf {v} \ge 0$ if and only if $\varXi _{i}\ge 0$. As a conclusion, the weighted kernel ${\tilde{K}}(\cdot ,\cdot )=K(\cdot ,\cdot ) \cdot \varXi _{i}$ is positive semi-definite if and only if $\varXi _{i}\ge 0$. $\square $

Using the reliance weighted kernel function instead of the heat kernel function, the solution in (14) becomes

$$\begin{aligned} \mathbf {F}^*= {{\tilde{\mathbf{K}}}}\varvec{\Theta } = \mathbf {K}\varvec{\varXi }\varvec{\Theta }. \end{aligned}$$

(18)

The coefficient matrix $\varvec{\Theta }^*$ can be estimated by differentiating the right hand side of (11) as

$$\begin{aligned}&\frac{2}{l}\varvec{\Psi } \mathbf {K}\varvec{\varXi }(\varvec{\Psi } \mathbf {K}\varvec{\varXi } \varvec{\Theta }^*-\mathbf {Y}) + 2\gamma _A \mathbf {K}\varvec{\varXi } \varvec{\Theta }^*\\&\qquad + \frac{2\gamma _I}{n^2} (\mathbf {K}\varvec{\varXi })^T \mathbf {L} \mathbf {K}\varvec{\varXi } \varvec{\Theta }^* = 0 \end{aligned}$$

The coefficient matrix is eventually obtained as

$$\begin{aligned} \varvec{\Theta }^* = \left( \varvec{\Psi }\mathbf {K}\varvec{\varXi } + l\gamma _A\mathbf {I} + \frac{l\gamma _I}{n^2} \mathbf {L} \mathbf {K} \varvec{\varXi } \right) ^{-1} \mathbf {Y}. \end{aligned}$$

(19)

where $\mathbf {I}$ is a $n\times n$ identity matrix.

For unforeseen future samples ${{\tilde{\mathbf{X}}}}=[{{\tilde{\mathbf{x}}}}_1,{{\tilde{\mathbf{x}}}}_2,\ldots , {{\tilde{\mathbf{x}}}}_e]^T$ in $\mathbb {D}_e$, the label matrix ${{\tilde{\mathbf{F}}}}$ is obtained as follows: first, a $e\times n$ kernel matrix $\mathbf {K}_e$ is calculated using Eq. (5), i.e., ${\tilde{K}}_{ij} = K({{\tilde{\mathbf{x}}}}_i, \mathbf {x}_j)$ for $i=1,2,\ldots ,e$ and $j=1,2,\ldots ,n$. Next, the output ${{\tilde{\mathbf{F}}}}$ for ${{\tilde{\mathbf{X}}}}$ can be calculated as

$$\begin{aligned} {{\tilde{\mathbf{F}}}} = \mathbf {K}_e\varvec{\varXi }\varvec{\Theta }^*. \end{aligned}$$

(20)

Eventually, the label matrix ${{\tilde{\mathbf{Y}}}}$ of ${{\tilde{\mathbf{X}}}}$ is obtained by comparing each element of ${{\tilde{\mathbf{F}}}}$ with 0. We will henceforth refer to our multi-label extension of MR as Multi-Label Manifold Regularization (ML-MR), and our reliance weighting augmentation as ML-MR with Reliance Weighting (ML-MRRW).

There are clearly many strategies for determining reliance weights. The simplest strategy is to assign uniform weights, namely $\varXi _{i}=\nu _1\in [0,1], 1\le i \le l$ and $\varXi _{i}=\nu _2\in [0,1], l< i \le l+u$ for all labeled and unlabeled training instances, respectively. These two parameters then decide the balance of trust between labeled and unlabeled training data. The extended manifold regularization is supervised if $\nu _1=1$ and $\nu _2=0$ are used, and is unsupervised for the choice of $\nu _1=0$ and $\nu _2=1$. The relation $\nu _1=\nu _2$ indicates that the impacts of $\mathbb {D}_l$ and $\mathbb {D}_u$ to label inference are equal, whereas $\nu _1>\nu _2$ indicates that more weight is put on labeled instances $\mathbb {D}_l$ than that on unlabeled instances $\mathbb {D}_u$. In this work, we are trying to improve the performance of manifold regularization by trusting labeled instances more, and thus the choices of $\nu _1$ and $\nu _2$ must follow two criterions, namely $\nu _1=1$ and $\nu _1>\nu _2>0$.

Experimental design

This section designs experiments to validate the effectiveness of the proposed ML-MR and ML-MRRW methods on some commonly used benchmark data sets. Other semi-supervised multi-label classification methods are tested as comparisons, across a range of performance metrics.

Data sets

Four public data sets from different domains are chosen for the experimental study. Table 1 presents the basic information about these data sets. The first data set “Emotions” [52] consists of sampled wave forms of sound clips generated from different genres of musical songs. Each instance is labeled with 6 emotions: amazed-surprised, happy-pleased, relaxing-calm, quiet-still, sad-lonely, and angry-aggressive. The second data set “Scene” [7] is a commonly used image data set with each image represented by a 294-dimension feature vector and labeled with six classes: beach, sunset, field, fall-foliage, mountain, and urban. The third data set “Yeast” [19] consists of micro-array expression data and phylogenetic profiles for 2107 genes. Each gene is associated with a set of functional classes, which are grouped into 14 functional categories. The last data set “mediamill” [46] consists of digital video achieves for the TREC Video Retrieval Evaluation (TRECVID) challenge. This data set contains 120 features and 101 annotation concepts. These data sets are already formatted, so no further pre-processing is needed.

Table 1 Basic information of the selected public data sets

Semi-supervised multi-label classification using an extended graph-based manifold regularization

Abstract

Similar content being viewed by others

Semi-supervised multi-label feature selection with local logic information preserved

Multi-label feature selection via feature manifold learning and sparsity regularization

Robust and sparse label propagation for graph-based semi-supervised classification

Explore related subjects

Introduction

Preliminaries

Basics and notations

Regularization in reproducing kernel Hilbert space

The proposed method

Graph construction

Manifold regularization with multiple labels

Theorem 1

Proof

Reliance weighted kernel for performance improvement

Proposition 1

Proof

Experimental design

Data sets

Experiment setup

Performance metrics

Significance test

Experimental results and discussion

Case I: Emotions

Case II: Scene

Case III: Yeast

Case IV: Mediamill

Conclusion

Availability of data and material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation