Model-Change Active Learning in Graph-Based Semi-Supervised Learning

Active learning in semi-supervised classification involves introducing additional labels for unlabelled data to improve the accuracy of the underlying classifier. A challenge is to identify which points to label to best improve performance while limiting the number of new labels."Model-change"active learning quantifies the resulting change incurred in the classifier by introducing the additional label(s). We pair this idea with graph-based semi-supervised learning methods, that use the spectrum of the graph Laplacian matrix, which can be truncated to avoid prohibitively large computational and storage costs. We consider a family of convex loss functions for which the acquisition function can be efficiently approximated using the Laplace approximation of the posterior distribution. We show a variety of multiclass examples that illustrate improved performance over prior state-of-art.

1. Introduction.Supervised machine learning algorithms rely on datasets that contain an abundance of labels (i.e.known classifications) for associated inputs.In many real-world applications however, unlabeled data is ubiquitous, while obtaining labels for such training data is costly.Semi-supervised learning (SSL) methods leverage unlabeled data to achieve an accurate classification with significantly fewer training points.At the same time, the choice of training points often affects classifier performance, especially in the case of SSL due to the small training set size.Active learning seeks to choose a "useful" training set from which the underlying machine learning algorithm learns, and careful selection of such training data is motivated by the inherent cost to label data in practice.The main challenge in active learning is designing the criterion for selecting which unlabeled points are the most beneficial to label to significantly improve the underlying machine learning classifier's performance, more so than merely selecting random points to label.While there are a few different formulations of active learning, we focus on the pool-based active learning paradigm [26] as opposed to online or streaming-based active learning [35].That is, the active learner has access to a fixed "pool" of unlabeled data points from which it can decide a subset to label.
In pool-based active learning, most methods alternate between: (1) training a model given the current labeled data L, {y j } j∈L and (2) choosing a set of active learning query points Q in the unlabeled set U according to an acquisition function (also called an active learning criterion), see Figure 1.An "oracle" that has access to the classification of all data then labels points k ∈ Q; oftentimes in application, a domain expert plays the role of the oracle in identifying the true classification of points (i.e."human-in-the-loop" schemes).We refer to this iterative procedure of alternating between model training and query set selection and labeling as the active learning process.The goal then is to design an acquisition function  (1-red) training the underlying SSL classifier with the current labeled set L with labels {yj}j∈L and (2-grey) selecting query points Q from the unlabeled set (U) that are subsequently labeled according to the oracle and added to the labeled data L. The process repeats with the updated labeled data, retraining the SSL classifier to prepare for another active learning query and update.
A : U → R that quantifies how useful labeling a point k ∈ U would be in the active learning process.With a specified acquisition function A, we select the query set Q ⊂ U in the active learning process either sequentially (i.e. one at a time, |Q| = 1) or in a batch (i.e.|Q| = B ∈ N + ).
Most active learning acquisition functions belong to one of a few categories: uncertainty [35,17,14], margin [37,2,20], clustering [13,30], and look-ahead [42,9].Uncertaintybased acquisition functions favor unlabeled points whose classification is "uncertain".Criterion such as entropy [35], least-confident [35], Query by Disagreement (QBD) [11,2] and Bayesian Active Learning by Disagreement (BALD) [17] fall into this category.Closely related to uncertainty-based acquisition functions are margin-based acquisition functions that favor unlabeled points near the decision boundary of the current SSL classifier [37,16,20].Support Vector Machines (SVM) [38] are amenable to this type of criterion, as the concept of a margin and decision boundary are inherent in the model.Clustering-based methods rely on the geometric clustering structure of the input data to guide the active learning query choices.The works of Dasgupta et al [13], Dasarathy et al (S2) [12], and Murphy et al [30] are examples of acquisition functions that specifically exploit the clustering structure of the input data, usually based on a graph topology that captures these geometric relationships.
Look-ahead acquisition functions are a final category that we mention, and are the motivation for the present work.Look-ahead methods leverage the current SSL classifier's state to "look ahead" at what changes would occur in the SSL classifier as a result of labeling an unlabeled point, such as the seminal work of Zhu et al [42].Our proposed "model-change" acquisition function as well as the EMCM [9] and Maxi-Min "data-based norm" [22] methods are in this category.More specifically, the proposed model-change acquisition function A(k) of this paper approximates the difference between the current classifier u and the look-ahead classifier u +k,ŷ k resulting from adding k to the labeled set L with a hypothetical label, ŷk .
While many methods of late focus on applying active learning in deep learning architectures [14,24,34,36,1], we focus on graph-based SSL models in this paper.Graph-based SSL models leverage the geometric information from a similarity graph imposed on the set of feature vectors in conjunction with the previously observed labeling information contained in the labeled set, L ⊂ Z.These models allow for straightforward Bayesian probabilistic interpretations that are less clear in the majority of deep learning architectures [14].
Current Model Hypothetical "Look-Ahead" Models Figure 2. Illustration of our model-change active learning calculations inside of flowchart in Figure 1.For unlabeled indices k ∈ U, we compute the 2-norm difference between the current model (classifier) û and each hypothetical "look-ahead" models û+k,ŷ k .Hypothetical labels ŷk are "pseudo-labels" from current classifier, û.

The contributions of this work are to:
• Provide a unifying framework for active learning in graph-based SSL models with convex loss functions more appropriate for classification tasks, • Present "model-change" active learning acquisition function built around fast lookahead approximations previously only performed on interpolation RKHS models [22], • Apply a spectral truncation to the graph-based SSL models which allows for efficient storage and model-change calculations, • Demonstrate the speed and efficacy of the approach on both synthetic and real-world datasets, including an application to hyperspectral imagery (HSI).
2. Background.We first introduce the family of graph-based semi-supervised learning (SSL) models that use the model-change acquisition function.Then we discuss the probabilistic counterpart to these graph-based SSL models and how these Bayesian perspectives arise in previous work in active learning.
2.1.Graph-Based SSL Models.Consider the input data X = {x 1 , x 2 , . . ., x N } ⊂ R d of N feature vectors with corresponding index set Z = {1, 2, . . ., N }.Assume there exists a "ground-truth" classification (labeling) function y † : Z → {1, 2, . . ., n c } that maps each point x i to exactly one class, identified by the label y † i ∈ {1, 2, . . ., n c }1 .Observations y j of these ground-truth labelings are given for indices j ∈ L ⊂ Z, i.e. the labeled set.The goal of SSL is to infer the ground truth classifications y † k for k ∈ U = Z − L, i.e. the unlabeled set.In the binary case, we denote the concatenation of the observed labelings {y j } j∈L as a vector y ∈ R |L| , whereas in the multiclass case we denote the concatenations of the observed one-hot vectors {y j } j∈L as a matrix Y ∈ R |L|×nc .
We construct a similarity graph G(Z, W ) with edge weights W ij = κ(x i , x j ) ≥ 0 calculated by a similarity kernel κ.For example, the Gaussian kernel, κ(x i , with kernel width parameter σ > 0, is a popular choice.Important geometric information about the data manifold is encoded in graph Laplacian matrices [3,39].Two such matrices are the unnormalized graph Laplacian matrix L u = D − W and the normalized graph Laplacian matrix is the diagonal degree matrix.There are other possible graph Laplacian matrices one could define, but we only consider the normalized graph Laplacian, L. We enforce symmetry in L to ensure that the eigenvalues and eigenvectors of this matrix are real-valued.Graph Laplacian matrices are positive semi-definite and so to ensure invertibility of this matrix we consider the perturbation L τ = L + τ 2 I with parameter τ > 0. The matrix L τ is now positive definite and invertible, which also ensures a well-defined Bayesian prior distribution that we present in Subsection 2.2. Define a continuous-valued function on the nodes of the graph u : Z → R nc This function can be identified with a matrix U ∈ R N ×nc , where the i th row of U , u i , corresponds to the inferred classification of node i via a mapping S : R nc → {1, 2, . . ., n c } that maps the vectors {u i } i∈Z to the space of possible classes.For example, the mapping S(u i ) = arg max c=1,2,...,nc [u i ] c infers the classification of node i as the index of the maximal element of the vector.Other thresholding functions have been explored for providing consistent classifiers [10].In the special case of binary classification, the matrix U actually can be represented by a vector u ∈ R N , with mapping S(u i ) = sgn(u i ) = δ {u i ≥0} .
Graph-based SSL then leverages the graph Laplacian matrix via a quadratic form to regularize the SSL problem in order to encourage similar labels for inputs that lie close to each other on the underlying data manifold [3].In the binary case, we solve the following optimization problem where : R nc × R nc → [0, ∞) is a chosen loss function and •, • F is the Frobenius inner product.Table 1 lists a family of graph-based SSL models defined by the objective functions of (2.1) and (2.2); the model-change acquisition function of this present work applies to each.Graph-based SSL models historically have been motivated by Gaussian Random Fields (GRF) [41], and the Harmonic Functions (HF) model [42] has been influential in active learning in graph-based SSL.This model can be thought of using a "hard-constraint" loss function, where (x, y) = 0 if x = y and +∞ otherwise.Acquisition functions minimizing look-ahead expected risk (MBR [42], TSA [21]), posterior variance (V-Opt [18], Σ-Opt [28]), and other 2 Ψγ(t) = t −∞ ψγ(s)ds is the cumulative distribution function (CDF) of a log-concave probability density function (PDF) ψγ(s) with parameter γ > 0.
3 This loss requires that x, y be discrete probability distributions, which is adapted in Subsection 2.1.1.

This work
Cross-Entropy (CE) [23,19] − nc c=1 x c ln(y c ) 3  This work Table 1 Family of graph-based SSL models.We extend the results of MC [31] to be more computationally efficient and allow for multiclass classification.measures of uncertainty [27] come from this model.The authors in [24] even use the HF model to enhance active learning performance within deep learning.
2.1.1.Cross-Entropy (CE) Model.As an alternative to regression for the multiclass setting, we show how to incorporate the cross-entropy loss function (x, y) = − nc c=1 x c ln(y c ) into the multiclass graph-based SSL framework (2.2).The cross-entropy loss requires that both inputs are probability distribution vectors on the set of possible classes, {1, 2, . . ., n c }.While the observations y j satisfy this property because of their one-hot form, the rows of arbitrary U ∈ R N ×nc do not necessarily satisfy this same condition.The entries of U are not even constrained to be non-negative.As such, we apply the"softmax" function on the rows of U to enforce this probability distribution property.Denoting the (j, c) th entry of U by U j,c and the c th entry of y j by [y j ] c , we have S(u j ; γ) := 1 nc h=1 e U j,h /γ (e U j,1 /γ , e U j,2 /γ , . . ., e U j,nc /γ ) T , e U j,h /γ , recalling that y j ∈ {1, 2, . . ., n c } is associated with the one-hot vector y j = e y j ∈ R nc .We write the Cross-Entropy model as 2.2.Bayesian Interpretation of SSL Problems.These graph-based SSL objective functions lend themselves to a Bayesian probabilistic interpretation, as discussed in prior literature [42,7,15,18,29,32,31].In the binary case, (2.1) is equivalent to finding the maximum a posteriori (MAP) estimate of a posterior probability distribution whose density P(u|y) relates to the objective function via P(u|y) ∝ exp (−J (u; y)) = e − 1 2 u,Lτ u e − j∈L (u j ,y j ) ∝ µ(u)e −Φ (u;y) , (2.4) where the prior µ(u) follows a Gaussian prior, N (0, L −1 τ ), and the likelihood, q(y|u) ∝ exp(−Φ (u; y)), is defined by the likelihood potential Φ (u; y) := j∈L (u j , y j ).This prior relates to other graph-based SSL priors proposed in previous literature [15].The Gaussian prior represents a prior belief over the distribution of functions u on the nodes of the graph, per the geometry of the data captured in the graph Laplacian matrix.The likelihood represents assumptions about the generative model that created the observed labelings y j from the ground-truth classifications y † j .Each loss function for (2.1) defines a different likelihood and consequently a different modeling assumption about the observed labels y (or Y ).We note that although the prior µ(u) is Gaussian, the posterior distribution of (2.4) for general loss functions is not necessarily Gaussian.
A key idea of the present work is to approximate the non-Gaussian posterior distributions via suitable Gaussian distributions to exploit the resulting convenient form of what we term the "look-ahead" posterior mean and covariance.This formulation allows us to use loss functions that are arguably more natural for classification than merely the squared-error loss.

Look-Ahead Model. Recalling the binary graph-based SSL model objective
then let the "look-ahead" model objective be where we add the unlabeled index k ∈ U with pseudo-label ŷk := sgn(û k ) to the labeled data.Look-ahead models have previously been introduced for designing various acquisition functions (e.g.[22,8,9,42]).We emphasize here that in application, since k ∈ U, we do not have access to the true labeling y † k , and so this look-ahead model is a hypothetical model.For a given k ∈ U and one-hot encoding ŷk ∈ R nc of the pseudo-label ŷk = S(û k ) ∈ {1, 2, . . ., n c }, the multiclass look-ahead model becomes The present work exploits a key property of the HF, GR, and MGR models that the lookahead model's posterior mean (i.e. the graph-based SSL model's classifier) and covariance matrix are easily calculated as rank-one updates of the current model's posterior mean and covariance matrix.We discuss this more in Sections 3.2, 3.3, and 3.4.

Laplace Approximation.
Laplace approximation is a popular technique for approximating non-Gaussian distributions with a Gaussian distribution [33].A given probability distribution, identified by its probability density function (PDF) P(x), can be approximated via the Gaussian distribution where x is the maximum-a-posteriori (MAP) estimator of P and Ĉ is the Hessian matrix of the negative-log density of the distribution evaluated at the MAP estimator, x.By applying the Laplace approximation to the non-Gaussian posterior distributions of Subsection 2.2, one can approximate look-ahead updates for calculating the model-change acquisition function.Choosing the top B maximizers of the current values of {A(k)} k∈U can yield sub-optimal results, as these maximizers often are close in the ambient feature space -in a sense "wasting" precious query budget on redundant information.Some batch active learning methods select query points via a greedy sequential process [18,28,9], wherein the method selects the maximizer k * Other batch active learning methods restrict the size of the set of indices on which A is evaluated to a smaller candidate set S ⊂ U.These methods select the batch Q ⊂ S to be the top maximizers of the acquisition function [14,24], where S is chosen uniformly at random form U. This has essentially two important and positive consequences.First, evaluating A only S is obviously computationally faster since |S| |U|.Second, by selecting S ⊂ U at random, we partially alleviate the problem of "redundant" calculations since the maximizers of A over S likely do not lie all close together.We apply this query set set selection method to our batch active learning experiments; i.e., select S ⊂ U uniformly at random and then select the top B maximizers of the acquisition function.
3. Model-Change Acquisition Derivation.We derive the model-change acquisition function for a modified objective function that utilizes only a subset of the eigenvalues and eigenvectors of the graph Laplacian matrix.Subsection 3.1 defines spectral truncation modification of the family of graph-based SSL objectives.We derive in Subsection 3.2 the model-change acquisition function first for the binary classification models by utilizing the Laplace approximation and a corresponding simple approximate update via Newton's method.We expand this idea to multiclass models in Subsection 3.3 and Subsection 3.4.

Spectral Truncation.
Bayesian-inspired graph-based acquisition functions often require storing a prohibitively large and dense covariance matrix C ∈ R N ×N in memory, where N is the size of the dataset.We introduce a modification to the graph-based models to allevi-ate this burden by using only a subset of the eigenvalues and eigenvectors of the corresponding graph Laplacian matrix L; we refer to this as "spectral truncation".
The eigenvalues of the graph Laplacian matrix L can be ordered 0 = λ 1 ≤ λ 2 ≤ . . .≤ λ N , with the first eigenvalue guaranteed to be 0 by properties of L [39].The low-lying eigenvalues (i.e.closer to 0) and their corresponding eigenvectors of L contain important geometric information pertaining to the data.For example, spectral clustering [39] utilizes the eigenvectors corresponding to the first K eigenvalues of L to embed the data into R K and thereafter perform clustering, often via the K-Means algorithm.
This work uses the M eigenvalues closest to 0 and their corresponding eigenvectors to obtain approximations of Ĉ and Ĉ. Recalling the perturbed graph Laplacian, L τ = L + τ 2 I, then by considering only the first M < N eigenvalues of L, define the matrices where v i is the eigenvector corresponding to the i th eigenvalue λ i .Replace the graph-based regularization terms of (2.1) and (2.2) to obtain the following The vector α = V T u (matrix A = V T U ) is the projection of u (U ) onto the M coresponding eigenvectors.Since the eigenvectors are orthonormal (V T V = I ∈ R M ×M ), α is the vector of coefficients of the representation of u in that eigenbasis for u ∈ span{v 1 , . . ., v M }.By restricting to this subspace, we not only speed up model training since the latent space is of (much) smaller dimension (M N ), but we also reduce the spatial complexity of the covariance matrix for the model-change acquisition function calculations.

Binary Model.
We first derive the model-change acquisition function on the binary model (3.1) for the GR, Logistic, and Probit models.The derivation for the binary model follows similarly to [31], except now done on this "spectral truncation" modification.

Laplace Approximation of the Binary Model.
The Laplace approximation of the posterior distribution corresponding to (3.1), with respect to the variable α ∈ R M , gives .
where F (x, y) is the second derivative of the loss function (x, y) with respect to the first variable following the notation of [15,31].That is, F (x, y) := ∂ ∂x (x, y) and F (x, y) := ∂ 2 ∂x 2 (x, y).Ĉα is symmetric because Λ τ is diagonal and the objective function J (u; y) is differentiable.Recall that the mean of this Gaussian distribution α is the MAP estimator of the true posterior, which corresponds to the coefficients of the desired graph-based SSL model's classifier û in the eigenbasis represented by V .This Gaussian distribution is now in a form in which one could apply adaptations of acquisition functions like MBR [42], V-Opt [18], and Σ-Opt [28] that were originally defined on Gaussian models (e.g.GR and HF), only now using other convex loss functions besides the squared-error loss. 4.2.2.Look-Ahead Updates.With α denoting the current model's ( J (α; y)) MAP estimator, let αk,ŷ k denote the look-ahead model's ( Jk,ŷ k (α; y, ŷk )) MAP estimator.While for most loss functions one cannot directly compute αk,ŷ k as a rank-one update from α, we compute an approximation αk,ŷ k ≈ αk,ŷ k by computing a single step of Newton's Method on the look-ahead objective Jk,ŷ k (α; y, ŷk ), starting with the current MAP estimator This gives where in the second to last line we have used (A.1) and we have introduced v k := (e T k V ) T , the k th row of V as a column vector in R M .Note that in the second line, α satisfies ∇ J( α; y) = Λ τ α + j∈L F ((v j ) T α, y j )e j = 0 since it is the minimizer of J(α; y).

Model-Change (MC) Acquisition
Function.We now define the acquisition function that we term "Model-Change" (MC).This criterion quantifies how much the underlying graph-based SSL classifier changes as a result of adding a node k ∈ U and hypothesized label ŷk ; that is, we measure how much the model classifier, û = V α, would change under the lookahead model, ûk,ŷ k = V αk,ŷ k .We approximate this model change via the update αk,ŷ k of (3.3).Previous works indicate that calculating the approximate change in a model (classifier) from the addition of an index k and associated pseudo-label ŷk is an effective criterion for active learning [31,9,22].The present work extends the MC method in [31] and resembles the Maxi-Min method of [22], wherein the authors investigate active learning in kernel-based interpolation.While exact look-ahead calculations are possible in [22], the current work allows for a broader range of methods (Table 1) by approximating look-ahead calculations via (3.3).
Employing (3.3), we propose the following MC acquisition function since the columns of V are orthonormal and ûk = v T k α.Recall that ŷk is the current "pseudolabel" for node k as given by the current classifier û.We can write this acquisition function explicitly for each considered binary classification model as where ûk ŷk = ûk sgn(û k ) = |û k |.Note that in the GR model case, this notion of "modelchange" as calculated is exact; the value A GR (k) is exactly how much the underlying GR classifier would change by selecting k with the label ŷk = sgn(û k ).

Multiclass Gaussian Regression.
We now apply the derivation of the acquisition function to the MGR case.Recalling (3.2), then the gradient and Hessian are which then gives the Gaussian posterior distribution .
The posterior mean update (which is exact for MGR) becomes With the pseudolabel ŷk = arg max c=1,2,...,nc Ûk,c and corresponding one-hot encoding ŷk ∈ R nc , the MC acquisition function becomes (using the relation U = V A) where we have used the orthonormality of the columns of V and (A.2).

3.4.
Cross-Entropy Model.The softmax function in the Cross-Entropy (CE) model of Subsection 2.1.1 introduces dependencies between the columns of U .For ease in calculations, let the vector u ∈ R N nc be the concatenation of the columns of U .Likewise, define α ∈ R M nc to be the concatenation of the columns of the matrix A. Further, define V := diag (V, V, . . ., V ) ∈ R N nc×M nc and Λ τ := diag (Λ τ , Λ τ , . . ., Λ τ ) ∈ R M nc×M nc .We also define P j ∈ R nc×N nc to be the projection matrix that picks out the indices in u ∈ R N nc corresponding to node j; i.e., selecting the j th row of the related matrix U ∈ R N ×nc .Then u = Vα, and with êi denoting the i th standard basis vector in R N nc , the spectral truncation CE objective (2.3) can be written as where u = Vα and V T V = I M nc by orthonormality of the eigenvectors.Defining e α,V T êj+(c−1)N /γ nc h=1 e α,V T êj+(h−1)N /γ and π j := (π j 1 , . . ., π j nc ) T ∈ R nc , the gradient and Hessian of (3.4) are where we refer the reader to the supplementary material (Appendix B) for full calculation details.The Laplace approximation for the CE model yields , where we emphasize the dependence of C α on the variable α, specifically taking the value C α at the MAP estimator α.The inverse ∈ R M nc×M nc is not prohibitively costly to compute because of its restricted size.Referring to the calculations in the supplementary material (Appendix B) the look-ahead calculations become where A Cholesky decomposition of the positive semi-definite B k = T T k T k and using (A.1) twice enable the approximation With pseudo-label ŷk and corresponding one-hot encoding ŷk , the MC acquisition function for the CE spectral truncation modification becomes This is efficient to compute because calculating

Experiments & Numerics. This section results for the Model-Change (MC) acquisition function compared to other active learning acquisition functions in the various graph-based SSL models of Table 1. We reference each acquisition function in the form [abbreviation of acquisition function]-[abbreviation of underlying model];
for example MC-GR denotes the MC acquisition function in the Gaussian Regression (GR) model.The acquisition function acronyms are: MC (Model-Change), UNC (Uncertainty [35]), RAND (Random), VOPT (V-Opt [21]), and SOPT (Σ-Opt [28]).The underlying models considered are GR (Gaussian Regression, 3.2.3),MGR (Multiclass Gaussian Regression, 3.3), LOG (Logistic, 3.2.3),P (Probit, 3.2.3),and CE (Cross-Entropy, 3.4), all in the spectral truncation form of 3.1.Each plot shows the ground truth classification (red/blue) for Binary-Clusters dataset as well as the active learning choices (yellow stars) for an assortment of acquisition functions.UNC-LOG points that are between only some squares, while VOPT-HF selects points evenly spread out over the whole domain.In either case, overall performance is suboptimal compared to DB-RKHS, and MC-GR methods which select points located in each of the squares as well as between squares.
We showcase the method on a synthetic dataset (Binary-Clusters) as well as three realworld datasets: MNIST [25] and two hyperspectral imagery (HSI) datasets, Salinas A and Urban.On the binary tests, we include comparisons with the data-based norm acquisition function (DB-RKHS) [22] as well as the original V-Opt [21] and Σ-Opt [28] methods in the Harmonic Functions model [41], annotated as VOPT-HF and SOPT-HF in the plots below.We calculate the average accuracies over five trials according to the underlying SSL classifier of the acquisition function; that is, for the choices from the MC-GR acquisition function, we report the accuracies in the GR model.We straightforwardly adapt V-Opt and Σ-Opt methods for the MGR model to allow for comparison on the multiclass tests, see Appendix D.
All but one of the tests run in the "batch" mode of active learning, wherein for simplicity we select B = 5 points for the query set Q per active learning iteration.Per the discussion of Subsection 2.5, at each iteration the candidate set S ⊂ U contains 10% of the total points in U, sampled uniformly at random.The query set comprises the top B = 5 maximizers of the active learning acquisition function on S. While an interesting question, this paper does not explore varying the batch size nor the candidate set size; we leave this for future work.

Graph Construction Settings.
We construct similarity graphs using shared parameters across the different datasets.For the non-hyperspectral datasets (Binary-Clusters and MNIST), the similarity graph contain 10-nearest neighbors with weights w ij given by the Gaussian similarity kernel κ(x i , x j ) = exp( x i − x j 2 2 /σ), with σ = 3.For the hyperspectral datasets (Salinas A and Urban), the similarity graph is constructed using 15-nearest neighbors with weights w ij given by the cosine similarity κ(x i , x j ) = x i , x j / x i 2 x j 2 .As is common in similarity graph construction, we employ Zelnik-Perona scaling [40], but only for the nonhyperspectral datasets.Due to the sparse nature of these similarity graphs, the M = 50 lowest eigenvalues of the graph's normalized Laplacian matrix are calculated with standard sparse eigensolvers. 6In the binary experiments, the eigenvalue perturbation τ (Section 3) is set to τ = 0.001, while τ = 0.005 for the multiclass experiments.For the binary experiments (GR, LOG, and P), the loss parameter is set to γ = 0.5.We use the reported settings of h = 0.1 in the RKHS model [22] and δ = 0.01 in the HF model [42] in the corresponding experiments.For the multiclass experiments, γ = 0.1, 0.5 in the MGR and CE models, respectively.4.2.Binary-Clusters.Binary-Clusters is a synthetically created dataset we created consisting of various clusters of differing sizes, locations, and spreads.Figure 3a shows the ground-truth classification of these clusters.In our code 7 we provide the code for recreating this particular dataset with a function entitled create binary clusters().We run two tests: one that selects 100 query points sequentially (B = 1 per iteration) and the other that selects 100 query points in batches (B = 5 per iteration).Both tests begin with only 2 initially labeled points, one in each class.Successful active learning on this dataset requires "exploring" the many different regions (squares) as well as "exploiting" the true decision boundaries between adjacent squares efficiently while the underlying classifer improves.

MNIST.
The MNIST [25] dataset contains 70,000 grayscale 28 × 28 pixel images of handwritten digits (0-9).We represent each image as a 784-dimensional vector x i and normalize the pixel values to range from 0 to 1.Each trial begins with 20 labeled points (i.e.≈ 0.03% of the total, 2 points per class) and then selects 500 active learning points in batches of size B = 5.We consider the graph containing the full 70,000 points in the MNIST dataset and calculate the accuracy over the unlabeled set at each iteration, not a held-out "testing set".This is more relevant to the SSL framework, as opposed to supervised learning.

Salinas A Hyperspectral Imagery Dataset.
The Salinas A dataset is a common Hyperspectral Imagery (HSI) dataset containing 7,138 total pixels in a 83 × 86 image in d = 224 wavelengths.This is an image of Salinas, USA taken with the Aviris sensor; this image contains 6 classes of plant types arranged in a diagonal pattern (see Figure 7a).The goal is to classify the pixels x i ∈ R d in the image into material classes based on the samples from the d different wavelengths.The "ground-truth" classification is shown in Figure 7a.Each VOPT-HF and SOPT-HF achieve a quick initial accuracy increase in the sequential test (4a), but level off at a lower accuracy than MC-GR, MC-P, and DB-RKHS; the HF and RKHS acquisition functions are however more costly to compute (Figure 8).trial begins with two initially labeled points per class and then selects 500 active learning points in batches of size B = 5.

Urban Hyperspectral Imagery Dataset.
The Urban dataset is another common HSI dataset which contains 94, 249 total pixels in a 307 × 307 image where each pixel represents a 2 × 2 m 2 area.The original hyperspectral image contains 210 wavelengths sampled ranging from 400 nm to 2,500 nm, resulting in a spectral resolution of 10 nm.As is typical in HSI, atmospheric effects and dense water vapor contaminate a number of the wavelengths and so we consider only the remaining 162 wavelengths.Pixels belong to one of six categories: asphalt, grass, tree, roof, metal, and dirt.Figure 7b shows the "ground-truth" classification.Each trial begins with two initially labeled points per class and then selects 500 active learning points in batches of size B = 5.For an acquisition function to be useful in the active learning process, its accuracy curve should both (1) initially increase rapidly compared to other methods and (2) not level off at a lower final accuracy.In the binary classification tests, i.e. the sequential and batch tests on the Binary-Clusters dataset (Figures 4a and 4b), the VOPT and SOPT acquisition functions in the HF model perform well early on, but level off at a lower overall accuracy.By comparison, the MC accuracy curves increase slightly slower in the beginning, but achieve a higher overall accuracy than VOPT and SOPT.The DB-RKHS method performs very well in both tests, but we note this model and associated acquisiton function are more costly to compute than our spectral-truncation models, see Figure 8.All methods perform better than uncertainty sampling (UNC-LOG8 ) in both tests, which is especially slow to increase the accuracy early on in the sequential test.
Figure 3 shows the distribution of active learning choices for a few of the considered methods in the binary sequential test, allowing us a glimpse at the empirical characteristics of each acquisition function's choices.Note that the VOPT-HF choices (3c) are nearly spread out evenly among the whole unit square domain, while the MC-GR(3d), and DB-RKHS (3e) choices include points in every square, but also contain a higher concentration of choices along the boundaries between the squares.The MC-GR and DB-RKHS methods not only explore the extent of the domain of the dataset, but also exploit classification information by selecting points along the boundaries between the clusters throughout the whole domain.In contrast, the VOPT-HF method selects points evenly spread out over the whole domain, which arguably helps this method to achieve a beneficial increase early on in the active learning process, but does not transition to exploiting known classification information along the decision boundaries.UNC-LOG(3b) in this run chooses points that lie between the long, tall blue cluster on the right and the top-right red cluster, while ignoring various other clusters in the dataset; in a sense, UNC-LOG exploits the known classification information without sufficiently exploring the extent of the dataset's domain.
While the DB-RKHS method is very similar in flavor to our MC acquisition function, it is more computationally expensive both in model training as well as in acquisition function evaluation (see Figure 8) because of dense similarity kernel computations 9 .Likewise, the HF model requires a matrix inversion of a large submatrix of the graph Laplacian which is undesirable for scaling to larger problems.By restricting our underlying classifier's to the span of only a subset of the eigenvalues and eigenvectors of the graph Laplacian L, we achieve faster model training and acquisition function evaluation.Despite this significant model compression the present work is competitive with these more costly methods and models.
For all the multiclass tests (Figure 5 and 6), the MC-CE acquisition function performs the best at both increasing the underlying model's accuracy early on and obtaining the highest accuracy overall in the active learning process.The CE classifier initially has lower accuracy than the MGR classifier in the MNIST and Urban tests, but quickly achieves a higher accuracy.While one can remedy the low initial accuracy of the CE model in practice by hyperparameter tuning, we set the hyperparameters to be consistent across the shown datasets so as to showcase the efficacy of the acquisition function regardless of hyperparameter tuning.Further, the aim of the active learning process is to iteratively choose subsets of points to improve the underlying classifier and ultimately achieve the highest accuracy under the chosen model, not the design of the optimal underlying classifier in the presence of few labeled data.
We conclude this discussion with a note about the scalability of the acquisition function evaluations.Figure 8 shows the average time per active learning iteration to calculate the acquisition function on the candidate set, for increasing dataset size, N .Solid lines present timing results for binary models, while dashed linear present timing results for multiclass models.We exclude the SOPT results as they are indistinguishable from the VOPT results.
The family of binary VOPT and MC acquisition functions scale similarly, though the P (Probit) model's MC acquisition function has a significant overhead cost due to the repeated PDF and CDF calculations required for evaluating F and F (see Subsection 3.2).In contrast, the DB-RKHS and VOPT-HF acquisition functions scale noticeably worse.The multiclass acquisition functions all scale similarly to each other, though the MC-CE method has greater overhead cost.The size of the posterior covariance matrix (i.e.inverse Hessian used in lookahead calculations) for the MGR model (C Â ∈ R M ×M ) compared to the CE model (C α ∈ R M nc×M nc ) straightforwardly clarify this disparity.We conclude that the MC acquisition function adapted to the spectral truncation from of graph-based SSL provides both a scalable and effective method for active learning.Here we present the details of the gradient and Hessian calculations for the spectral truncation Cross-Entropy Model of Subsection 3.4.
With e j ∈ R N , we define , so that we can compute the Hessian Now turning to the look-ahead objective, we likewise introduce , and then the look-ahead model's gradient and Hessian become where we have defined With the Cholesky decomposition of the positive semi-definite where we have applied the Woodbury Identity (A.1) twice.e α,V T êj+(h−1)N /γ , so that by Section B the Hessian is With the positive definiteness of the diagonal eigenvalue matrix Λ τ , we just need to show that the matrix ∇ 2 Φ(α; Y ) := V T D L (α) − Π L (α)Π T L (α) V is positive semi-definite.We show that ∇ 2 Φ(α; Y ) ∈ R N nc×N nc is positive semi-definite by showing that it is symmetric, with non-negative diagonal entries, and is diagonally dominant.This matrix is clearly symmetric, so we turn our attention to the other properties.
We briefly explain how we can adapt the V-Opt [18] and Σ-Opt [28] methods to fit into this spectral truncation framework (specifically the GR model) as we use it to compare against the MC acquisition function.Recall that these acquisition functions were originally derived on the Harmonic Functions model, with covariance matrix C HF where we note that both are functions of the look-ahead model's covariance matrix.The V-Opt criterion comes from applying the trace (Tr[•]) of the look-ahead covariance C +k,ŷ k HF , while the Σ-Opt criterion comes from applying what is called the survey risk ( 1, •1 ) [28] to the look-ahead posterior covariance.Both of these methods are motivated by Bayesian optimal experimental design [35, ?, ?], which in the active learning context reduces to selecting unlabeled points that minimize these functions of the look-ahead covariance matrix.Once simplified, the acquisition functions of (D.1) are then in a form to be maximized.We apply similar functions to the spectral truncation modification's corresponding lookahead posterior covariance matrix for the GR model.Recall α|y ∼ N ( α, C α) where C α = Λ τ + 1 γ 2 V T P T P V −1 and α = 1 γ 2 C αV T P T y so that we have u ∼ N (V α, V C αV T ).We compute the V-Opt and Σ-Opt acquisition functions as where we have used the orthonormality of the columns of V .Now as both the V-Opt and Σ-Opt acquisition functions were originally formulated to minimize these functions (respectively Tr[•], 1, •1 ) of the posterior covariance matrix, we can rewrite these modified acquisition functions in the maximizing paradigm (D.2)

Figure 1 .
Figure 1.Active Learning Flowchart.Alternate between (1-red) training the underlying SSL classifier with the current labeled set L with labels {yj}j∈L and (2-grey) selecting query points Q from the unlabeled set (U) that are subsequently labeled according to the oracle and added to the labeled data L. The process repeats with the updated labeled data, retraining the SSL classifier to prepare for another active learning query and update.

2. 5 .
Active Learning Query Set Selection.With acquisition function A(k) in hand, one must select the query set Q ⊂ U from these acquisition values.Sequential active learning selects the maximizer k * = arg max k∈U A(k) of the acquisition function (i.e.|Q| = 1).In batch active learning (|Q| = B > 1), there is an added difficulty of how to best choose this subset.

1 =
arg max k∈U A(k) first, and then updates the acquisition function values per the new, hypothetical SSL model with added index k * 1 and associated pseudolabel ŷk * 1 .They next select the maximizer of the updated acquisition values k

Figure 3 .
Figure 3. Binary-Clusters Sequential Active Learning Choices.Each plot shows the ground truth classification (red/blue) for Binary-Clusters dataset as well as the active learning choices (yellow stars) for an assortment of acquisition functions.UNC-LOG points that are between only some squares, while VOPT-HF selects points evenly spread out over the whole domain.In either case, overall performance is suboptimal compared to DB-RKHS, and MC-GR methods which select points located in each of the squares as well as between squares.

= 5 Figure 4 .
Figure 4. Binary-Clusters accuracy plots, with 2 initially labeled points in each experiment.Sequential (4a) performs 200 active learning iterations selecting B = 1 points per iteration, while Batch (4b) performs 100 active learning iterations selecting B = 5 points per iteration.VOPT-HF and SOPT-HF achieve a quick initial accuracy increase in the sequential test (4a), but level off at a lower accuracy than MC-GR, MC-P, and DB-RKHS; the HF and RKHS acquisition functions are however more costly to compute (Figure8).

Figure 5 .
Figure 5. Accuracy plots for acquisition functions in the MGR model applied to MNIST, Salinas A, and Urban datasets.For each dataset, two points are selected uniformly at random from each class to initially label and then acquisition functions select 500 points in 100 batches of size B = 5.

Figure 6 .
Figure 6.Accuracy plots for acquisition functions in the CE model applied to MNIST, Salinas A, and Urban datasets.For each dataset, two points are selected uniformly at random from each class to initially label and then acquisition functions select 500 points in 100 batches of size B = 5.

Figure 8 .
Figure 8.Average Active Learning Query Iteration Timing Comparison.Each iteration of the active learning process selects uniformly at random a candidate set comprising 10% of the unlabeled dataset on which to evaluate the corresponding acquisition function.Times shown are the averaged recorded time to evaluate the acquisition function on the candidate set for increasing overall dataset size, N .Solid lines present timing results for binary models, while dashed linear present timing results for multiclass models.

5 .
Conclusion.We present a novel Model-Change (MC) active learning acquisition function along with a general framework that unifies different graph-based semi-supervised learning (SSL) models.Applying the Laplace approximation to the non-Gaussian Bayesian posterior distributions arising from different loss functions in the family of graph-based SSL models of Table 1 admits efficient approximations of how the underlying classifier could change as a result of labeling unlabeled points.This framework and associated active learning acquisition function are made more efficient by introducing the "spectral truncation" modifications, wherein we use only the lower-lying eigenvalues and corresponding eigenvectors in constraining the graph-based SSL classifiers as well as diminishing the memory requirements of the model.The MC acquisition function shows to be efficient for active learning compared to other methods natural for the graph-based SSL setting.Appendix B. Gradient and Hessian Calculations for Cross-Entropy Model.
Appendix C. Strict Convexity of Cross-Entropy (CE) Model.We verify that the CE objective function (3.4) is strictly convex by showing the positive definiteness of the Hessian.The crux is merely showing that the likelihood potential for the CE model is indeed convex, per the properties of its Hessian matrix.Combining this property with the strict convexity of the graph-based regularizer proves the existence of unique minimizers of the graph-based CE objective function.Recall objective function for the spectral truncation paradigm, JCE (α; Y ) = 1 2 α, Λ τ α + j∈L − 1 γ y j , P j Vα + ln nc h=1