Abstract
Graphbased learning algorithms including label propagation and spectral clustering are known as the effective stateoftheart algorithms for a variety of tasks in machine learning applications. Given input data, i.e. feature vectors, graphbased methods typically proceed with the following three steps: (1) generating graph edges, (2) estimating edge weights and (3) running a graph based algorithm. The first and second steps are difficult, especially when there are only a few (or no) labeled instances, while they are important because the performance of graphbased methods heavily depends on the quality of the input graph. For the second step of the threestep procedure, we propose a new method, which optimizes edge weights through a local linear reconstruction error minimization under a constraint that edges are parameterized by a similarity function of node pairs. As a result our generated graph can capture the manifold structure of the input data, where each edge represents similarity of each node pair. To further justify this approach, we also provide analytical considerations for our formulation such as an interpretation as a crossvalidation of a propagation model in the feature space, and an error analysis based on a low dimensional manifold model. Experimental results demonstrated the effectiveness of our adaptive edge weighting strategy both in synthetic and real datasets.
Keywords
Graphbased learning Manifold assumption Edge weighting Semisupervised learning Clustering1 Introduction
Graphbased learning algorithms have received considerable attention in machine learning community. For example, label propagation (e.g., Blum and Chawla 2001; Szummer and Jaakkola 2001; Joachims 2003; Zhu et al. 2003; Zhou et al. 2004; Herbster et al. 2005; Sindhwani et al. 2005; Belkin et al. 2006; Bengio et al. 2006) is widely accepted as a stateoftheart approach for semisupervised learning, in which node labels are estimated through the input graph structure. Spectral clustering (e.g., Shi and Malik 2000; Ng et al. 2001; Meila and Shi 2001; von Luxburg 2007) is also a famous graphbased algorithm, in which cluster partitions are determined according to the minimum cut of the given graph. A common important property of these graphbased approaches is that the manifold structure of the input data can be captured by the graph. Their practical performance advantage has been demonstrated in various application areas (e.g., Patwari and Hero 2004; Lee and Kriegman 2005; Zhang and Zha 2005; Fergus et al. 2009; Aljabar et al. 2012).
 Step 1:
Generating graph edges from given data, where nodes of the generated graph correspond to the instances of input data.
 Step 2:
Giving weights to the graph edges.
 Step 3:
Estimating node labels based on the generated graph, which is often represented as an adjacency matrix.
In this paper, we focus on the second step in the threestep procedure; estimating edge weights for the subsequent label estimation. Optimizing edge weights is difficult in semi or unsupervised learning, because there are only a small number of (or no) labeled instances. Also this problem is important because edge weights heavily affect the final prediction accuracy of graphbased methods, while in reality rather simple heuristics strategies have been employed.
There are two standard approaches for estimating edge weights: similarity function based and locally linear embedding (LLE) (Roweis and Saul 2000) basedapproaches. Each of these two approaches has its own disadvantage. The similarity based approaches use similarity functions, such as Gaussian kernel, while most similarity functions have scale parameters (such as the width parameter of Gaussian kernel) that are in general difficult to be tuned. On the other hand, in LLE, the true underlying manifold can be approximated by a graph by minimizing a local reconstruction error. LLE is more sophisticated than the similaritybased approach, and LLE based graphs have been applied to semisupervised learning and clustering (Wang and Zhang 2008; Daitch et al. 2009; Cheng et al. 2009; Liu et al. 2010). However LLE is noisesensitive (Chen and Liu 2011). In addition, to avoid a kind of degeneracy problem (Saul and Roweis 2003), LLE has to have additional tuning parameters.^{1} Yet another practical approach is to optimize weights by regarding them as hyperparameters of learning methods (e.g., Zhang and Lee 2007). Also general model selection criteria can be used, while the reliability of those criteria are unclear for graphs with a small number of labeled instances. We will discuss those related approaches in Sect. 5.

Our formulation alleviates the problem of overfitting due to the parameterization of weights. We observed that AEW is more robust against noise of input data and the change of the number of graph edges.

Since edge weights are defined as a parameterized similarity function, resultant weights still represent the similarity of each node pair. This is very reasonable for many graphbased algorithms.
The rest of this paper is organized as follows: In Sect. 2, we briefly review some standard algorithms in graphbased methods on which we focus in this paper. Section 3 introduces our proposed method for adaptively optimizing graph edge weights. Section 4 describes analytical consideration for our approach which provides interesting interpretations and error analysis of AEW. In Sect. 5, we discuss relationships to other existing topics. Section 6 presents experimental results obtained by a variety of datasets including synthetic and realworld datasets, demonstrating the performance advantage of the proposed approach. Finally, Sect. 7 concludes the paper.
This paper is an extended version of our preliminary conference paper presented at NIPS 2013 (Karasuyama and Mamitsuka 2013). In this paper, we describe our framework in a more general way by using three wellknown graphbased learning methods [harmonic Gaussian field (HGF) model, local global consistency (LLGC) method, and spectral clustering], while the preliminary version only deals with HGF. Furthermore, we have conducted experimental evaluation more thoroughly which includes mainly three points: in semisupervised setting, (1) comparison with other stateoftheart semisupervised methods and (2) comparison with hyperparameter optimization methods, and as an additional problem setting, (3) comparisons on the clustering.
2 Graphbased semisupervised learning and clustering
In this paper we consider label propagation and spectral clustering as the methods in the third step in the threestep procedure. Both are the stateoftheart graphbased learning algorithms, and labels of graph nodes (or clusters) are estimated by using a given adjacency matrix.
Suppose that we have n feature vectors \(\mathcal{X}= \{ \varvec{x}_1, \ldots , \varvec{x}_n \}\), where \(\varvec{x}\in \mathbb {R}^p\). An undirected graph \(\mathcal{G}\) is generated from \(\mathcal{X}\), where each node (or vertex) corresponds to each data point \(\varvec{x}_i\). The graph \(\mathcal{G}\) can be represented by the adjacency matrix \(\varvec{W}\in \mathbb {R}^{n \times n}\) where (i, j)element \(W_{ij}\) is a weight of the edge between \(\varvec{x}_i\) and \(\varvec{x}_j\). The key idea of graphbased classification is that instances connected by large weights \(W_{ij}\) on a graph tend to have the same labels (meaning that labels are kept the same in the strongly connected region of the graph).
2.1 Label propagation
Label propagation is a widelyaccepted graphbased semisupervised learning algorithm. Among many methods which have been proposed so far, we focus on the formulation derived by Zhu et al. (2003) and Zhou et al. (2004), which is the current standard formulation of graphbased semisupervised learning.
2.2 Spectral clustering
3 Basic framework
The performance of the graphbased algorithms, which are described in the previous section, heavily depends on the quality of the input graph. Our proposed approach, adaptive edge weighting (AEW), optimizes the edge weights for the graphbased learning algorithms. In the three step procedure, we note that AEW is for the second step and has nothing to do with the first and third steps. In this paper we consider that the input graph is generated by kNN graph (the first step is based on kNN), while we note that AEW can be applied to any types of graphs.

Capturing the manifold structure of the input space.

Representing similarity between two nodes.
3.1 Formulation
3.2 Optimization
The gradient can be computed efficiently, due to the sparsity of the adjacency matrix. Since the number of edges of a kNN graph is O(nk), the derivative of adjacency matrix \(\varvec{W}\) can be calculated by O(nkp). Then the entire derivative of the objective function can be calculated by \(O(nkp^2)\). Note that k often takes a small value such as \(k = 10\).
3.3 Normalization
The standard similarity function (4) cannot adapt to differences of the local scaling around each data point. These differences may cause a highly imbalanced adjacency matrix, in which \(W_{ij}\) has larger values around high density regions in the input space while \(W_{ij}\) has much smaller values around low density regions. As a result, labels in the high density regions will be dominantly propagated.
4 Analytical considerations
In Sect. 3, we defined our approach as the minimization of the local reconstruction error of input features. We here describe several interesting properties and interpretations of this definition.
4.1 Interpretation as feature propagation
4.2 Local linear approximation
The feature propagation model provides the interpretation of our approach as the optimization of the adjacency matrix under the assumption that \(\varvec{x}\) and y can be reconstructed by the same adjacency matrix. We here justify our approach in a more formal way from a viewpoint of local reconstruction with a lower dimensional manifold model.
4.2.1 Graphbased learning as local linear reconstruction
First, we show that all graphbased methods that we consider can be characterized by the local reconstruction for the outputs on the graph. The following proposition shows that all three methods which we reviewed in Sect. 2 have the local reconstruction property:
Proposition 1
 HGF:$$\begin{aligned} F_{ik} = \frac{\sum _{j} W_{ij} F_{jk}}{D_{ii}}\quad \text { for }\;\; i = \ell +1, \ldots , n. \end{aligned}$$
 LLGC:$$\begin{aligned} F_{ik} = \frac{\sum _{j} W_{ij} F_{jk}}{D_{ii} + \lambda } + \frac{\lambda Y_{ik}}{D_{ii} + \lambda }\quad \text { for }\; i = 1, \ldots , n. \end{aligned}$$
 Spectral clustering:where \(\rho _k\) is the kth smallest eigenvalue of \(\varvec{L}\).$$\begin{aligned} F_{ik} = \frac{\sum _{j} W_{ij} F_{jk}}{D_{ii}  \rho _k}\quad \text { for }\; i = 1, \ldots , n, \end{aligned}$$
Proof
For HGF, the same equation was shown in Zhu et al. (2003). We here derive the reconstruction equations for LLGC and spectral clustering.
Regarding the optimization problems of the three methods as the minimization of the same score penalty term \(\text {trace}(\varvec{F}\varvec{L}\varvec{F})\) under the different regularization strategies, which prevent a trivial solution \(\varvec{F}= {\varvec{0}}\), it is reasonable that the similar reconstruction form is shared by those three methods. Among the three methods, HGF has the most standard form of local averaging. The output of the ith node is the weighted average over their neighbors connected by the graph edges. LLGC can be interpreted as a regularized variant of the local averaging. The averaging score \(\varvec{W}\varvec{F}\) is regularized by the initial labeling \(\varvec{Y}\), and the balance of regularization is controlled by the parameter \(\lambda \). Spectral clustering also has a similar form to the local reconstruction. The only difference here is that the denominator is modified by the eigenvalue of graph Laplacian. The eigenvalue \(\rho _k\) of graph Laplacian has a smaller value when the score matrix \(\varvec{F}\) has a smaller amount of variation on neighboring nodes. Spectral clustering thus has the same local reconstruction form in particular when the optimal scores have close values for neighboring nodes.
4.2.2 Error analysis
Proposition 1 shows that the graphbased learning algorithms can be regarded as local reconstruction methods. We next show the relationship between the local reconstruction error in the feature space described by our objective function (6) and the output space. For simplicity we consider the vector form of the score function \(\varvec{f}~\in ~\mathbb {R}^n\) which can be considered as a special case of the score matrix \(\varvec{F}\), and discussions here can be applied to \(\varvec{F}\).
Theorem 1
Proof
From (10), we can see that the reconstruction error of \(y_i\) consists of three terms. The first term includes the reconstruction error for \(\varvec{x}_i\) which is represented by \(\varvec{e}_i\), and the second term is the distance between \(\varvec{\tau }_i\) and \(\{ \varvec{\tau }_j \}_{j \sim i}\). Minimizing our objective function corresponds to reducing this first term, which means that reconstruction weights estimated in the input space provide an approximation of reconstruction weights in the output space. The two terms in (10) have a kind of tradeoff relationship because we can reduce \(\varvec{e}_i\) if we use a lot of data points \(\varvec{x}_j\), but then \( \delta \varvec{\tau }_i\) would increase. The third term is the intrinsic noise which we cannot directly control.
A simple approach to exploit this theorem would be the regularization formulation, which can be a minimization of a combination of the reconstruction error for \(\varvec{x}\) and a penalization term for distances between data points connected by the edges. Regularized LLE (Wang et al. 2008; Cheng et al. 2009; Elhamifar and Vidal 2011; Kong et al. 2012) can be interpreted as one realization of such an approach. However, in the context of semisupervised learning and unsupervised learning, selecting appropriate values of the regularization parameter for such a regularization term is difficult. We therefore optimize edge weights through the parameter of a similarity function, especially the bandwidth parameter of Gaussian similarity function \(\sigma \). In this approach, a very large bandwidth (giving large weights to distant data points) may cause a large reconstruction error, while an extremely small bandwidth causes the problem of not giving enough weights to reconstruct.
For symmetric normalized graph Laplacian, we can not apply Theorem 1 to our algorithm. For example, in HGF, the local averaging relation for normalized Laplacian is \(f_i = \sum _{j \sim i} W_{ij} f_j / \sqrt{D_{ii} D_{jj}}\). The following theorem is the normalized counterpart of Theorem 1:
Theorem 2
Proof

Since the sum of the reconstruction coefficients \(W_{ij}/\sqrt{D_{ii}D_{jj}}\) is no longer 1, the interpretation of local linear patch cannot be applied to (13).

The reconstruction error (objective function) led by Theorem 2 (i.e., \(\varvec{x}_i  \sum _{j \sim i} {W_{ij} \varvec{x}_j}/{\sqrt{D_{ii} D_{jj}}}\)) results in a more complicated optimization.

The error Eq. (14) in Theorem 2 has the additional term \((1  \sum _{j \sim i} \gamma _j ) (h(\varvec{\tau }_i) + \varvec{J}g(\varvec{\tau }_i))\) compared to Theorem 1.
5 Related topics
We here describe relations of our approach with other related topics.
5.1 Relation to LLE
The objective function (6) is similar to the local reconstruction error of LLE (Roweis and Saul 2000), in which \(\varvec{W}\) is directly optimized as a real valued matrix. This manner has been used in many methods for graphbased semisupervised learning and clustering (Wang and Zhang 2008; Daitch et al. 2009; Cheng et al. 2009; Liu et al. 2010), but LLE is very noisesensitive (Chen and Liu 2011) and the resulting weights \(W_{ij}\) cannot necessarily represent the similarity between the corresponding nodes (i, j). For example, for two nearly identical points \(\varvec{x}_{j_1}\) and \(\varvec{x}_{j_2}\), both connecting to \(\varvec{x}_i\), it is not guaranteed that \(W_{ij_1}\) and \(W_{ij_2}\) have similar values. To solve this problem, a regularization term can be introduced (Saul and Roweis 2003), while it is not easy to optimize the regularization parameter for this term. On the other hand, we optimize parameters of the similarity (kernel) function. This parameterized form of edge weights can alleviate the overfitting problem. Moreover, obviously, the optimized weights still represent the node similarity.
5.2 Other hyperparameter optimization strategies
AEW optimizes the parameters of graph edge weights without labeled instances. This property is powerful, especially for the case with only few (or no) labeled instances. Although several methods have been proposed for optimizing graph edge weights with standard model selection approaches (such as crossvalidation and marginal likelihood maximization) by regarding them as usual hyperparameters in supervised learning (Zhu et al. 2005; Kapoor et al. 2006; Zhang and Lee 2007; Muandet et al. 2009), most of those methods need labeled instances and become unreliable under the cases with few labels. Another approach is optimizing some criterion designed specifically for each graphbased algorithm (e.g., Ng et al. 2001; Zhu et al. 2003; Bach and Jordan 2004). Some of these criteria however have degenerate (trivial) solutions for which heuristics are proposed to prevent such solutions but the validity of those heuristics is not clear. Compared to these approaches, our approach is more general and flexible for problem settings, because AEW is independent of the number of classes (clusters), the number of labels, and the subsequent learning algorithms (the third step). In addition, model selection based approaches are basically for the third step in the threestep procedure, by which AEW can be combined with such methods, like that the optimized graph by AEW can be used as the input graph of these methods.
5.3 Graph construction
Besides kNN, there have been several methods generating a graph (edges) from the feature vectors (e.g., Talukdar 2009; Jebara et al. 2009; Liu et al. 2010). Our approach can also be applied to those graphs because AEW only optimizes weights of edges. In our experiments, we used the edges of the kNN graph as the initial graph of AEW. We then observed that AEW is not sensitive to the choice of k, comparing with usual kNN graphs. This is because the Gaussian similarity value becomes small if \(\varvec{x}_i\) and \(\varvec{x}_j\) are not close to each other to minimize the reconstruction error (6). In other words, redundant weights can be reduced drastically, because in the Gaussian kernel, weights decay exponentially according to the squared distance.
In the context of spectral clustering, a connection to statistical properties of graph construction has been analyzed (Maier et al. 2009, 2013). For example, Maier et al. (2013) have shown conditions under which the optimal convergence rate is achieved for a few types of graphs. These different studies also indicate that the quality of the input graph affects the final performance of learning methods.
5.4 Discussion on manifold assumption
A key assumption for our approach is manifold assumption which has been widely accepted in semisupervised learning (e.g., see Chapelle et al. 2010). In manifold assumption, the input data is assumed to lie on a lowerdimensional manifold compared to the original input space. Although verifying manifold assumption itself accurately would be difficult (because it is equivalent to estimating intrinsic dimensionality of the input data), the graphbased approach is known as a practical approximation of the underlying manifold which is applicable without knowing such a dimensionality. Many empirical evaluations have revealed that the manifold assumptionbased approaches (most of them are graphbased) achieve high accuracy in various applications, particularly image and text data (see e.g., Patwari and Hero 2004; Lee and Kriegman 2005; Zhang and Zha 2005; Fergus et al. 2009; Chapelle et al. 2010; Aljabar et al. 2012). In these applications, the manifold assumption is reasonable, as implied by prior knowledge of the given data (e.g., in the face image classification, each person lies on different lowdimensional manifolds of pixels).
Another important implication in most of graphbased semisupervised learning is cluster assumption (or low density separation) in which different classes are assumed to be separated by a lowdensity region. Graphbased approaches assume that nodes in the same class are densely connected while different classes are not so. If different classes are not separated by a lowdensity region, a nearestneighbor graph would connect different classes which may cause missclassification by propagating wrong class information. Several papers have addressed this problem (Wang et al. 2011; Gong et al. 2012). They considered the existence of singular points at which different classes of manifolds have intersections. Their approach is to measure similarity of two instances by considering similarity of tangent spaces of two instances, but this approach has to consider accurately modeling both of the local tangent and their similarity measure which introduces additional parameters and estimation errors. We perform experimental evaluation for this approach in Sect. 6.2.1.
6 Experiments
We evaluate the performance of our approach using synthetic and realworld datasets. AEW is applicable to all graph based learning methods reviewed in Sect. 2. We investigated the performance of AEW using the harmonic Gaussian field (HGF) model and local and global consistency (LLGC) model in semisupervised learning, and using spectral clustering (SC) in unsupervised learning. For comparison in semisupervised learning, we used linear neighborhood propagation (LNP) (Wang and Zhang 2008), which generates a graph using a LLE based objective function. LNP can have two regularization parameters, one of which is for the LLE process (the first and second steps in the threestep procedure), and the other is for the label estimation process (the third step in the threestep procedure). For the parameter in the LLE process, we used the heuristics suggested by Saul and Roweis (2003), and for the label propagation process, we chose the best parameter value in terms of the test accuracy. LLGC also has the regularization parameter in the propagation process (3), and we chose the best one again. This choice was to remove the effect by model selection and to compare the quality of the graphs directly. HGF does not have such hyperparameters. All experimental results were averaged over 30 runs with randomly sampled data points.
6.1 Synthetic datasets
Using simple synthetic datasets in Fig. 2, we here illustrate the advantage of AEW by comparing the prediction performance in the semisupervised learning scenario. Two datasets in Fig. 2 have the same form, but Fig. 2b has several noisy data points which may become bridge points (which can connect different classes, defined by Wang and Zhang 2008). In both cases, the number of classes is 4 and each class has 100 data points (thus, \(n = 400\)).
Table 1 shows the error rates for the unlabeled nodes of HGF and LNP under 0–1 loss. For HGF, we used the median heuristics to choose the parameter \(\sigma _d\) in similarity function (4), meaning that a common \(\sigma (= \sigma _1 = \cdots = \sigma _p\)) is set as the median distance between all connected pairs of \(\varvec{x}_i\), and as the normalization of graph Laplacian, the symmetric normalization was used. The optimization of AEW started from the median \(\sigma _d\). The results by AEW are shown in the column ‘AEW + HGF’ of Table 1. The number of labeled nodes was 10 in each class (\(\ell = 40\), i.e., 10% of the entire datasets), and the number of neighbors in the graphs was set as \(k = 10\) or 20.
Test error comparison for synthetic datasets
Dataset  k  HGF  AEW \(+\) HGF  LNP 

(a)  10  0.057 (0.039)  0.020 (0.027)  0.039 (0.026) 
(a)  20  0.261 (0.048)  0.020 (0.028)  0.103 (0.042) 
(b)  10  0.119 (0.054)  0.073 (0.035)  0.103 (0.038) 
(b)  20  0.280 (0.051)  0.077 (0.035)  0.148 (0.047) 
6.2 Realworld datasets
List of datasets
n  p  No. of classes  

COIL  500  256  10 
USPS  1000  256  10 
MNIST  1000  784  10 
ORL  360  644  40 
Vowel  792  10  11 
Yale  250  1200  5 
Optdigit  1000  256  10 
UMIST  518  644  20 
6.2.1 Semisupervised learning
 1.
Minimization of entropy (Zhu et al. 2003) with normalized Laplacian (NLMinEnt): NLMinEnt is represented by solid lines with the star shaped marker. We set the smoothing factor parameter \(\varepsilon \) in NLMinEnt, which prevents the Gaussian width parameter from converging to 0, as \(\varepsilon = 0.01\), and if the solution still converges to such a trivial solution, we increase \(\varepsilon \) by multiplying 10. The results of NLMinEnt was unstable. Although we see that MinEnt stably improved the accuracy for all numbers of labels in the COIL dataset, NLMinEnt sometimes largely deteriorated the accuracy, for example, in Open image in new window , and 4 for the ORL dataset.
 2.
Crossvalidation with normalized Laplacian (NLCV): We employed 5fold cross validation and a method proposed by Zhang and Lee (2007) which imposes an additional penalty to the CV error, defined as deviation of the Gaussian width parameter from predefined value \(\tilde{\sigma }\). We used the median heuristic for \(\tilde{\sigma }\), and selected an additional regularization parameter for hyperparameter optimization from \(\{ 0.001, 0.01, 0.1 \}\) by 5fold CV with different data partitioning (Note that we can not apply the CV approach for only one labeled instance for each class).
Overall AEWNLHGF had the best prediction accuracy, where typical examples were USPS and MNIST. Although Theorem 1 exactly holds only for AEWLHGF among all methods, we can see that AEWNLHGF, which is scaled for both Euclidian distance in the similarity and the degrees of the graph nodes, had more highly stable performance. This result suggests that balancing node weights (by normalized Laplacian) is practically advantageous to stabilize the propagation process of label information.
 1.
Spectral multimanifold clustering (SMMC) (Wang et al. 2011): Wang et al. (2011) proposed to generate a similarity matrix using a similarity of tangent spaces for two instances. One of their main claims is that SMMC can handle intersections of manifolds which are difficult to deal with by using standard nearestneighbor graph methods as we mentioned in Sect. 5.4. Although SMMC is proposed for a graph construction of spectral clustering, the resulting graph can be used for semisupervised learning directly. We used the implementation by the authors,^{2} and a recommended parameter setting in which the number of local probabilistic principal components analysis (PPCA) is \(M = \lceil n/(10d) \rceil \), the number of neighbors is \(k = 2 \lceil \log (n) \rceil \), and a parameter of an exponential function in the similarity is \(o = 8\), where d is a reduced dimension by PPCA which was set as \(d = 10\) except for the Vowel dataset (\(d = 5\) was used for the Vowel dataset because the original dimension is 10). Due to high computational complexity of SMMC which is at least \(O(n^3 + n^2 d p^2)\) (see the paper for detail), we used those default settings instead of performing crossvalidation.
 2.
Measure propagation (MP) (Subramanya and Bilmes 2011): MP estimates class assignment probabilities using a Kullback–Leibler (KL) divergence based criterion. We used the implementation provided by the authors.^{3} Subramanya and Bilmes (2011) proposed an efficient alternating minimization procedure which enables us to perform crossvalidation based model selection in the experiment. We tuned three parameters of MP including two regularization parameters \(\mu \) and \(\nu \), and a kernel width parameter. The regularization parameters were selected from \(\mu \in \{ 10^{4}, 10^{3}, 10^{2}, 0.1, 1 \}\) and \(\nu \in \{ 10^{4}, 10^{3}, 10^{2}, 0.1, 1 \}\). For a kernel function, we employed the local scaling kernel, in which the width parameters were set as \(\sigma _d = 10^a \sigma \), where \(\sigma \) is the value determined by the median heuristics and the parameter a were selected by crossvalidation with 5 values uniformly taken from \([1,1]\). When we can not perform crossvalidation because of the lack of labeled instances, we used \(\mu = 10^{2}, \nu = 10^{2}\), and \(a = 0\).
Test error rate comparison using LLGC (averages on 30 runs and their standard deviation)
NLLGC  LLLGC  NLLLGC  AEW  AEW  AEW  LNP  

NLLGC  LLLGC  NLLLGC  
COIL 1  0.310 (0.047)  0.296 (0.046)  0.296 (0.046)  0.176 (0.031)\(^*\)  0.158 (0.035) \(^*\)  0.172 (0.035)\(^*\)  0.239 (0.027) 
5  0.179 (0.034)  0.170 (0.033)  0.170 (0.034)  0.087 (0.029) \(^*\)  0.084 (0.033) \(^*\)  0.086 (0.032) \(^*\)  0.111 (0.027) 
10  0.116 (0.026)  0.112 (0.024)  0.110 (0.026)  0.047 (0.021) \(^*\)  0.050 (0.020) \(^*\)  0.051 (0.018) \(^*\)  0.066 (0.022) 
USPS 1  0.290 (0.068)  0.258 (0.064)  0.271 (0.066)  0.263 (0.060)\(^*\)  0.220 (0.061) \(^*\)  0.241 (0.063)\(^*\)  0.369 (0.078) 
5  0.174 (0.018)  0.152 (0.017)  0.155 (0.017)  0.150 (0.019)\(^*\)  0.121 (0.018) \(^*\)  0.127 (0.018)\(^*\)  0.219 (0.026) 
10  0.140 (0.013)  0.123 (0.014)  0.124 (0.012)  0.117 (0.016)\(^*\)  0.096 (0.012) \(^*\)  0.097 (0.012) \(^*\)  0.171 (0.018) 
MNIST 1  0.420 (0.053)  0.424 (0.056)  0.404 (0.053)  0.387 (0.051)\(^*\)  0.356 (0.054) \(^*\)  0.361 (0.048) \(^*\)  0.471 (0.038) 
5  0.242 (0.019)  0.244 (0.026)  0.229 (0.019)  0.216 (0.020)\(^*\)  0.212 (0.024)\(^*\)  0.194 (0.022) \(^*\)  0.291 (0.025) 
10  0.204 (0.017)  0.199 (0.017)  0.192 (0.018)  0.179 (0.017)\(^*\)  0.167 (0.016) \(^*\)  0.161 (0.016) \(^*\)  0.241 (0.022) 
ORL 1  0.282 (0.026)  0.262 (0.026)  0.265 (0.027)  0.230 (0.023)\(^*\)  0.212 (0.028) \(^*\)  0.224 (0.028)\(^*\)  0.272 (0.026) 
3  0.171 (0.025)  0.142 (0.024)  0.137 (0.024)  0.092 (0.022)\(^*\)  0.084 (0.021) \(^*\)  0.086 (0.022) \(^*\)  0.112 (0.023) 
5  0.140 (0.020)  0.106 (0.018)  0.101 (0.019)  0.052 (0.017) \(^*\)  0.049 (0.016) \(^*\)  0.049 (0.017) \(^*\)  0.062 (0.016) 
Vowel 2  0.584 (0.032)  0.577 (0.029)  0.581 (0.031)  0.583 (0.030)  0.573 (0.030)  0.575 (0.028) \(^*\)  0.601 (0.029) 
10  0.337 (0.030)  0.326 (0.031)  0.325 (0.029)  0.314 (0.027)\(^*\)  0.303 (0.031) \(^*\)  0.306 (0.029) \(^*\)  0.344 (0.032) 
20  0.232 (0.023)  0.210 (0.022)  0.206 (0.023)  0.161 (0.023) \(^*\)  0.160 (0.023) \(^*\)  0.159 (0.023) \(^*\)  0.204 (0.024) 
Yale 1  0.679 (0.052)  0.678 (0.056)  0.675 (0.060)  0.554 (0.060) \(^*\)  0.539 (0.082) \(^*\)  0.544 (0.072) \(^*\)  0.537 (0.075) 
5  0.490 (0.043)  0.481 (0.046)  0.484 (0.045)  0.309 (0.057)\(^*\)  0.292 (0.061) \(^*\)  0.297 (0.059)\(^*\)  0.313 (0.048) 
10  0.394 (0.041)  0.385 (0.044)  0.386 (0.041)  0.226 (0.038)\(^*\)  0.217 (0.040) \(^*\)  0.224 (0.040)\(^*\)  0.230 (0.037) 
Optdigits 1  0.108 (0.040)  0.102 (0.038)  0.107 (0.039)  0.093 (0.038) \(^*\)  0.087 (0.034) \(^*\)  0.095 (0.034)\(^*\)  0.167 (0.065) 
5  0.051 (0.012)  0.051 (0.012)  0.050 (0.012)  0.043 (0.011) \(^*\)  0.044 (0.012) \(^*\)  0.044 (0.011) \(^*\)  0.073 (0.014) 
10  0.041 (0.006)  0.039 (0.006)  0.040 (0.006)  0.035 (0.007) \(^*\)  0.035 (0.007) \(^*\)  0.034 (0.006) \(^*\)  0.055 (0.011) 
UMIST 1  0.423 (0.032)  0.406 (0.032)  0.411 (0.032)  0.292 (0.035)\(^*\)  0.255 (0.036) \(^*\)  0.274 (0.042)\(^*\)  0.362 (0.028) 
5  0.172 (0.020)  0.165 (0.019)  0.165 (0.019)  0.099 (0.017)\(^*\)  0.089 (0.017) \(^*\)  0.095 (0.017)\(^*\)  0.122 (0.023) 
10  0.090 (0.018)  0.080 (0.017)  0.081 (0.017)  0.040 (0.014)\(^*\)  0.035 (0.013) \(^*\)  0.039 (0.013)\(^*\)  0.046 (0.017) 
Next, we used the LLGC model for comparison. Table 3 shows the test error rates for the eight datasets in Table 2. Here again, we can see that AEW improved the test error rates of LLGC, and here AEWLLLGC showed the best performance in all cases except only one case (MNIST Open image in new window ). In this table, the symbol ‘\(^*\)’ means that AEW improved the accuracy from the corresponding method without AEW (which is shown in one of the left three methods) in terms of ttest with the significance level of 5%. AEW improved the prediction performance of LLGC except Vowel under Open image in new window , and all three methods with AEW outperformed LNP in all 24 (\(= 3 \times 8\)) cases except only one exception.
In Table 3, AEWLLLGC or AEWNLLLGC (i.e., AEW with local scaling kernel) achieved the lowest test error rates in 22 out of all 24 cases. This result suggests that incorporating differences of local scaling is important for realworld datasets. We can also see that AEWNLLLGC, for which Theorem 1 does not hold exactly, shows the best in 14 cases (highlighted by boldface) and the second best in 9 cases out of all 24 cases.
Clustering performance comparison using ARI (averages on 30 runs and their standard deviation)
NSC  LSC  NLSC  AEW  AEW  AEW  kernel  

NSC  LSC  NLSC  SMCE  SMMC  kmeans  kmeans  
COIL  0.545  0.508  0.546  0.717 \(^*\)  0.700\(^*\)  0.738 \(^*\)  0.488  0.429  0.453  0.445 
(0.040)  (0.040)  (0.043)  (0.047)  (0.058)  (0.055)  (0.044)  (0.027)  (0.037)  (0.041)  
USPS  .585  0.523  0.600  0.619\(^*\)  0.553  0.640 \(^*\)  0.537  0.438  0.505  0.518 
(0.030)  (0.122)  (0.032)  (0.044)  (0.119)  (0.044)  (0.033)  (0.071)  (0.031)  (0.032)  
MNIST  0.387  0.082  0.405  0.417\(^*\)  0.469 \(^*\)  0.444\(^*\)  0.383  0.423  0.349  0.333 
(0.034)  (0.024)  (0.033)  (0.047)  (0.047)  (0.051)  (0.031)  (0.122)  (0.031)  (0.033)  
ORL  0.641  0.539  0.666  0.624  0.533  0.654  0.577  0.564  0.614  0.503 
(0.021)  (0.115)  (0.024)  (0.025)  (0.094)  (0.031)  (0.028)  (0.018)  (0.030)  (0.037)  
Vowel  0.185  0.168  0.187  0.131  0.146  0.193  0.160  0.157  0.212  0.214 
(0.017)  (0.023)  (0.018)  (0.051)  (0.033)  (0.016)  (0.012)  (0.042)  (0.009)  (0.015)  
Yale  0.092  0.087  0.099  0.302 \(^*\)  0.171\(^*\)  0.280\(^*\)  0.291  0.073  0.002  0.005 
(0.016)  (0.014)  (0.016)  (0.048)  (0.052)  (0.053)  (0.059)  (0.013)  (0.007)  (0.009)  
Optdigits  0.873  0.810  0.887  0.889  0.821  0.881  0.799  0.423  0.674  0.700 
(0.039)  (0.116)  (0.040)  (0.042)  (0.083)  (0.034)  (0.030)  (0.122)  (0.022)  (0.009)  
UMIST  0.426  0.385  0.425  0.560 \(^*\)  0.450\(^*\)  0.524\(^*\)  0.245  0.350  0.336  0.338 
(0.024)  (0.036)  (0.023)  (0.052)  (0.103)  (0.047)  (0.025)  (0.014)  (0.020)  (0.019) 
6.2.2 Clustering
We also demonstrate the effectiveness of AEW in an unsupervised learning scenario, especially in clustering. Table 4 shows adjusted Rand index (ARI) (Hubert and Arabie 1985) of each of nine compared method. ARI was extended from Rand index (Rand 1971), which evaluates the accuracy of clustering results based on pairwise comparison, in such a way that the expected value for random labeling of ARI should take 0 (and the maximum is 1). We employed spectral clustering (SC) as a graphbased clustering to which AEW applies, and for comparison, we used sparse manifold clustering and embedding (SMCE) (Elhamifar and Vidal 2011), kmeans clustering, and kernel kmeans clustering. SMCE generates a graph using a LLE based objective function. The difference from LLE is that SMCE prevents connections between different manifolds by penalizing edge weights with the weighted \(L_1\) norm, according to distances between node pairs. However, as often the case with LLE, there are no reliable ways to select the regularization parameter. For comparison, we further used SMMC again here because a graph created by this method is also applicable to spectral clustering, and the same parameter settings as the semisupervised case were used.
In Table 4, we can see that either AEWNSC or AEWNLSC achieved the best ARI for all datasets except Vowel. These two methods (AEWNSC and AEWNLSC) significantly improved the performance of SC with kNN graphs in four datasets, i.e. COIL, USPS, Yale, and UMIST. LSC and AEWLSC, i.e. SC and SC with AEW, both using unnormalized graph Laplacian, were not comparable to those with normalized graph Laplacian. AEWLSC significantly improved the performance of LSC in three datasets, while AEWLSC was not comparable to AEWNSC and AEWNLSC. These results suggest that the normalization of Laplacian is important for SC, being compared to label propagation. As a result, AEWNLSC showed the most stable performance among the three methods with AEW. In Vowel, kmeans and kernel kmeans achieved the best performance, suggesting that the lowerdimensional manifold model assumed in Theorem 1 is not suitable for this dataset. SMMC achieved comparable performance against the competing methods for MNIST only. The difficulty of this method would be the parameter tuning. Overall however we emphasize that spectral clustering with AEW achieved the best performance for all datasets except Vowel.
7 Conclusions
We have proposed the adaptive edge weighting (AEW) method for graphbased learning algorithms such as label propagation and spectral clustering. AEW is based on the minimization of the local reconstruction error under the constraint that each edge has the function form of similarity for each pair of nodes. Due to this constraint, AEW has numerous advantages against LLE based approaches, which have a similar objective function. For example, noise sensitivity of LLE can be alleviated by the parameterized form of the edge weights, and the similarity form for the edge weights is very reasonable for many graphbased methods. We also provide several interesting properties of AEW, by which our objective function can be motivated analytically. Experimental results have demonstrated that AEW can improve the performance of graphbased algorithms substantially, and we also saw that AEW outperformed LLE based approaches in almost all cases.
Footnotes
Notes
Acknowledgements
M.K. has been partially supported by JSPS KAKENHI 26730120, and H.M. has been partially supported by MEXT KAKENHI 16H02868 and FiDiPro, Tekes.
References
 Aljabar, P., Wolz, R., & Rueckert, D. (2012). Manifold learning for medical image registration, segmentation, and classification. In Machine learning in computeraided diagnosis: Medical imaging intelligence and analysis. IGI Global.Google Scholar
 Asuncion, A., & Newman, D. J. (2007). UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html.
 Bach, F. R., & Jordan, M. I. (2004). Learning spectral clustering. In S. Thrun, L. K. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems (Vol. 16). Cambridge, MA: MIT Press.Google Scholar
 Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7, 2399–2434.MathSciNetzbMATHGoogle Scholar
 Bengio, Y., Delalleau, O., & Le Roux, N. (2006). Label propagation and quadratic criterion. In O. Chapelle, B. Schölkopf, & A. Zien (Eds.), Semisupervised learning (pp. 193–216). Cambridge, MA: MIT Press.Google Scholar
 Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. In C. E. Brodley & A. P. Danyluk (Eds.), Proceedings of the 18th international conference on machine learning (pp. 19–26). Los Altos, CA: Morgan Kaufmann.Google Scholar
 Chapelle, O., Schlkopf, B., & Zien, A. (2010). Semisupervised learning (1st ed.). Cambridge, MA: The MIT Press.Google Scholar
 Chen, J., & Liu, Y. (2011). Locally linear embedding: A survey. Artificial Intelligence Review, 36, 29–48.CrossRefGoogle Scholar
 Cheng, H., Liu, Z., & Yang, J. (2009). Sparsity induced similarity measure for label propagation. In IEEE 12th international conference on computer vision (pp. 317–324). Piscataway, NJ: IEEE.Google Scholar
 Chung, F. R. K. (1997). Spectral graph theory. Providence, RI: American Mathematical Society.zbMATHGoogle Scholar
 Daitch, S. I., Kelner, J. A., & Spielman, D. A. (2009). Fitting a graph to vector data. In Proceedings of the 26th international conference on machine learning (pp. 201–208). New York, NY: ACM.Google Scholar
 Elhamifar, E., & Vidal, R. (2011). Sparse manifold clustering and embedding. In J. ShaweTaylor, R. Zemel, P. Bartlett, F. Pereira, & K. Weinberger (Eds.), Advances in neural information processing systems (Vol. 24, pp. 55–63).Google Scholar
 Fergus, R., Weiss, Y., & Torralba, A. (2009). Semisupervised learning in gigantic image collections. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 522–530). Red Hook, NY: Curran Associates Inc.Google Scholar
 Georghiades, A., Belhumeur, P., & Kriegman, D. (2001). From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6), 643–660.CrossRefGoogle Scholar
 Gong, D., Zhao, X., & Medioni, G. G. (2012). Robust multiple manifold structure learning. In The 29th international conference on machine learning. icml.cc/Omnipress.Google Scholar
 Graham, D. B., & Allinson, N. M. (1998). Characterizing virtual eigensignatures for general purpose face recognition. In H. Wechsler, P. J. Phillips, V. Bruce, F. FogelmanSoulie, & T. S. Huang (Eds.), Face recognition: From theory to applications; NATO ASI Series F, computer and systems sciences (Vol. 163, pp. 446–456).Google Scholar
 Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. J. (2007). A kernel method for the twosampleproblem. In B. Schölkopf, J. C. Platt, & T. Hoffman (Eds.), Advances in neural information processing systems (Vol. 19, pp. 513–520). Cambridge, MA: MIT Press.Google Scholar
 Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York, NY: Springer.CrossRefzbMATHGoogle Scholar
 Herbster, M., Pontil, M., & Wainer, L. (2005). Online learning over graphs. In Proceedings of the 22nd annual international conference on machine learning (pp. 305–312). New York, NY: ACM.Google Scholar
 Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.CrossRefzbMATHGoogle Scholar
 Jebara, T., Wang, J., & Chang, S.F. (2009). Graph construction and bmatching for semisupervised learning. In A. P. Danyluk, L. Bottou, & M. L. Littman (Eds.), Proceedings of the 26th annual international conference on machine learning (p. 56). New York, NY: ACM.Google Scholar
 Jegou, H., Harzallah, H., & Schmid, C. (2007). A contextual dissimilarity measure for accurate and efficient image search. In 2007 IEEE computer society conference on computer vision and pattern recognition. Washington, DC: IEEE Computer Society.Google Scholar
 Joachims, T. (2003). Transductive learning via spectral graph partitioning. In T. Fawcett & N. Mishra (Eds.), Machine learning, proceedings of the 20th international conference (pp. 290–297). Menlo Park, CA: AAAI Press.Google Scholar
 Kapoor, A., Qi, Y. A., Ahn, H., & Picard, R. (2006). Hyperparameter and kernel learning for graph based semisupervised classification. In Y. Weiss, B. Schölkopf, & J. Platt (Eds.), Advances in neural information processing systems (Vol. 18, pp. 627–634). Cambridge, MA: MIT Press.Google Scholar
 Karasuyama, M., & Mamitsuka, H. (2013). Manifoldbased similarity adaptation for label propagation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Weinberger (Eds.), Advances in neural information processing systems (Vol. 26, pp. 1547–1555). Red Hook, NY: Curran Associates Inc.Google Scholar
 Kong, D., Ding, C. H., Huang, H., & Nie, F. (2012). An iterative locally linear embedding algorithm. In J. Langford & J. Pineau (Eds.), Proceedings of the 29th international conference on machine learning (pp. 1647–1654). New York, NY: Omnipress.Google Scholar
 LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRefGoogle Scholar
 Lee, K.C., & Kriegman, D. (2005). Online learning of probabilistic appearance manifolds for videobased recognition and tracking. In IEEE conference on computer vision and pattern recognition (pp. 852–859).Google Scholar
 Liu, W., He, J., & Chang, S.F. (2010). Large graph construction for scalable semisupervised learning. In Proceedings of the 27th international conference on machine learning (pp. 679–686). New York, NY: Omnipress.Google Scholar
 Maier, M., von Luxburg, U., & Hein, M. (2009). Influence of graph construction on graphbased clustering measures. In Advances in neural information processing systems (Vol. 21, pp. 1025–1032). Red Hook, NY: Curran Associates, Inc.Google Scholar
 Maier, M., von Luxburg, U., & Hein, M. (2013). How the result of graph clustering methods depends on the construction of the graph. ESAIM: Probability and Statistics, 17, 370–418.MathSciNetCrossRefzbMATHGoogle Scholar
 Meila, M., & Shi, J. (2001). A random walks view of spectral segmentation. In T. Jaakkola & T. Richardson (Eds.), Proceedings of the eighth international workshop on artifical intelligence and statistics. Los Altos, CA: Morgan Kaufmann.Google Scholar
 Muandet, K., Marukatat, S., & Nattee, C. (2009). Robust graph hyperparameter learning for graph based semisupervised classification. In 13th PacificAsia conference advances in knowledge discovery and data mining (pp. 98–109).Google Scholar
 Nene, S. A., Nayar, S. K., & Murase, H. (1996). Columbia object image library (COIL20). Technical report, Technical Report CUCS00596.Google Scholar
 Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems (Vol. 14, pp. 849–856). Cambridge, MA: MIT Press.Google Scholar
 Patwari, N., & Hero, A. O. (2004). Manifold learning algorithms for localization in wireless sensor networks. In IEEE international conference on acoustics, speech, and signal processing (Vol. 3, pp. iii–857–60).Google Scholar
 Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.CrossRefGoogle Scholar
 Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.CrossRefGoogle Scholar
 Samaria, F., & Harter, A. (1994). Parameterisation of a stochastic model for human face identification. In Proceedings of the second IEEE workshop on applications of computer vision (pp. 138–142).Google Scholar
 Saul, L. K., & Roweis, S. T. (2003). Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4, 119–155.MathSciNetzbMATHGoogle Scholar
 Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.CrossRefGoogle Scholar
 Sindhwani, V., Niyogi, P., & Belkin, M. (2005). Beyond the point cloud: From transductive to semisupervised learning. In L. D. Raedt & S. Wrobel (Eds.), Proceedings of the 22nd international conference on machine learning (pp. 824–831). New York, NY: ACM.Google Scholar
 Spielman, D. A., & Teng, S.H. (2004). Nearlylinear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In L. Babai (Ed.), Proceedings of the 36th annual ACM symposium on theory of computing. New York, NY: ACM.Google Scholar
 Subramanya, A., & Bilmes, J. A. (2011). Semisupervised learning with measure propagation. Journal of Machine Learning Research, 12, 3311–3370.MathSciNetzbMATHGoogle Scholar
 Szummer, M., & Jaakkola, T. (2001). Partially labeled classification with Markov random walks. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems (Vol. 14, pp. 945–952). Cambridge, MA: MIT Press.Google Scholar
 Talukdar, P. P. (2009). Topics in graph construction for semisupervised learning. Technical report MSCIS0913, University of Pennsylvania, Department of Computer and Information Science.Google Scholar
 von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416.MathSciNetCrossRefGoogle Scholar
 von Luxburg, U., Belkin, M., & Bousquet, O. (2008). Consistency of spectral clustering. Annals of Statistics, 36(2), 555–586.MathSciNetCrossRefzbMATHGoogle Scholar
 Wang, F., & Zhang, C. (2008). Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering, 20, 55–67.CrossRefGoogle Scholar
 Wang, G., Yeung, D.Y., & Lochovsky, F. H. (2008). A new solution path algorithm in support vector regression. IEEE Transactions on Neural Networks, 19(10), 1753–1767.CrossRefGoogle Scholar
 Wang, Y., Jiang, Y., Wu, Y., & Zhou, Z. H. (2011). Spectral clustering on multiple manifolds. IEEE Transactions on Neural Networks, 22(7), 1149–1161.CrossRefGoogle Scholar
 ZelnikManor, L., & Perona, P. (2004). Selftuning spectral clustering. In Advances in neural information processing systems (Vol. 17, pp. 1601–1608). Cambridge, MA: MIT Press.Google Scholar
 Zhang, X., & Lee, W. S. (2007). Hyperparameter learning for graph based semisupervised learning algorithms. In B. Schölkopf, J. Platt, & T. Hoffman (Eds.), Advances in neural information processing systems (Vol. 19, pp. 1585–1592). Cambridge, MA: MIT Press.Google Scholar
 Zhang, Z., & Zha, H. (2005). Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM Journal on Scientific Computing, 26(1), 313–338.MathSciNetCrossRefzbMATHGoogle Scholar
 Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems (Vol. 16). Cambridge, MA: MIT Press.Google Scholar
 Zhu, X., Ghahramani, Z., & Lafferty, J. D. (2003). Semisupervised learning using gaussian fields and harmonic functions. In T. Fawcett & N. Mishra (Eds.), Proceedings of the twentieth international conference on machine learning (pp. 912–919). Menlo Park, CA: AAAI Press.Google Scholar
 Zhu, X., Kandola, J., Ghahramani, Z., & Lafferty, J. (2005). Nonparametric transforms of graph kernels for semisupervised learning. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 17, pp. 1641–1648). Cambridge, MA: MIT Press.Google Scholar