1 Introduction

Image segmentation is a basic problem in the field of machine learning and computer vision. One older approach involves partial differential equation (PDE)-based methods, which segment an image by solving a PDE on the image numerically, based on minimization of an energy functional [9, 12, 23, 34]. More recently, graph-based methods have also been developed for both semi-supervised and unsupervised learning on image processing [3, 7, 8, 14, 17, 18, 20, 27, 29, 30]. Another common choice is the neural network methods, including convolutional neural networks (CNN) [35] and graph convolutional networks (GCN) [41, 43], with trainable convolutional filters optimized by minimizing the difference between predicted and ground-truth labels.

We employ a graph-based learning method where the feature vectors of each pixel in an image are used to construct a graph whose edge weights are determined by feature vector similarity [6]. This approach has been proven successful in noisy image recovery [29], studies using remotely sensed images to combine LIDAR and optical images [21], and blind hyperspectral unmixing [37]. Graph learning is an approach that trains a classifier by minimizing a graph-based energy function to directly identify a function on the nodes of the graph. This is different from graph neural networks that train a convolutional kernel and evolve the corresponding convolutional operator on the graph.

Active learning is a branch of machine learning that judiciously selects a limited number of unlabeled data to query for labels, with the aim of maximally improving the underlying classifier’s performance [39]. An acquisition function is used to quantify which data would be useful to label from the set of available unlabeled data. Active learning can significantly improve classifier performance at very low label rates and minimize the cost of labeling data by domain experts [15, 31, 33, 39].

Traditional active learning selects labeled data sequentially, i.e., in each step, only the global maximum of the acquisition function is selected. Batch active learning selects a query set of multiple points in each step of the active learning process. Batch active learning provides new challenges compared to sequential active learning. Selecting data with similar information is redundant and does not fully utilize the acquisition function. Some prior methods for batch active learning imitate sequential active learning by selecting the batch through a greedy sequential process [11, 22, 25] or segment the candidate set into several small subsets and select the batch samples as the collection of the maximum points of each small subset [16, 24]. We introduce a novel batch active learning approach called LocalMax, to select a collection of unlabeled data that satisfy the graph local maximum condition in each step of the active learning process. The original version of LocalMax was developed for the synthetic aperture radar (SAR) image classification tasks [13]. Compared with other batch active learning approaches, LocalMax is more efficient while having almost identical performance as sequential active learning.

The novelty of this paper lies in the batch graph-based active learning pipeline for image segmentation tasks with very low label rates. This pipeline is specifically designed for the hyperspectral pixels classification task. With fewer than 0.5% labeled pixels selected by the active learning process, the graph learning classifier achieves similar accuracy as that with 10% randomly sampled labeled pixels.

Active learning has recently been shown to produce excellent results for hyperspectral image segmentation, especially when combined with similarity graphs and a novel Model-Change acquisition function [31]. However, such methods require sequential updates that involve the recalculation of the segmentation problem with each active learning step. While these steps are computationally efficient, there is a significant gain in efficiency to be made by developing a batch process for active learning. In the model-change method, a strategic choice of data point to be labeled is made at each sequential step, and that choice does not naturally extend to a batch process. In particular, if one uses a sequential acquisition function for a batch process, the points to be labeled will not be optimal (e.g., they could be chosen from the same area. We propose a novel efficient batch active learning method based on a LocalMax condition. This was initially developed for SAR image classification [13], and here we further develop it for hyperspectral pixel classification. In this paper, while expanding the application of LocalMax on the image segmentation task, we accelerate the LocalMax algorithm and provide a more detailed analysis of its computational complexity.

2 Background for Graph-Based Active Learning Model

In this section, we review basic graph learning and some active learning techniques applied to graph learning classifiers. We construct a similarity graph via a K-nearest neighbors approach [1]. We apply graph Laplace learning [45] with some labeled nodes to classify unlabeled nodes. The labeled nodes are selected through the active learning process.

2.1 Graph Construction

We generate a graph based on the dataset \(X = \{x_1,x_2,\cdots ,x_N\}\subset {\mathbb {R}}^d\) of d-dimensional feature vectors. X is indexed by the index set \(Z = \{1,2,\cdots ,N\}\). Consider the graph G(XW) with the vertex (node) set X and the edge weight matrix \(W \in {\mathbb {R}}^{N \times N}\), where \(W_{ij}\) denotes the edge weight between vertices \(i \ne j\). The weight \(W_{ij}\) is chosen to be proportional to the similarity between corresponding feature vectors \(x_i\) and \(x_j\). In our model, we choose

$$\begin{aligned} W_{ij} = \exp \left( -\frac{\angle (x_i,x_j)^2}{\sqrt{\tau _i\tau _j}}\right) , \end{aligned}$$
(1)

where \(\angle (x_i,x_j)= \arccos \left( \frac{x_i^\top x_j}{\Vert x_i\Vert \Vert x_j\Vert }\right)\) is the angle between feature vectors \(x_i\) and \(x_j\). The normalization constant \(\tau _i\) is chosen according to the similarity to the \(K{\mathrm {th}}\) nearest neighbor of i (i.e., \(\tau _i = \angle (x_i, x_{i_K})\), where \(x_{i_K}\) is the \(K{\mathrm {th}}\) nearest neighbor to \(x_i\)).

To improve the computational efficiency, we require the \(N\times N\) weight matrix W to be sparse. For each vertex \(x_i\), we only consider edges between \(x_i\) and its K-nearest neighbors (KNN) according to the angle similarity stated above. This can be done by an approximate nearest neighbor search algorithm [1]. Let \(x_{i_k},\ k=1,2,\cdots ,K\) be the K-nearest neighbors of \(x_i\) (including \(x_i\) itself) according to the angle similarity. Define a sparse weight matrix by

$$\begin{aligned} {\bar{W}}_{ij} = {\left\{ \begin{array}{ll} W_{ij}, \ j = i_1,i_2,\cdots ,i_K,\\ 0, \ \text {otherwise}. \end{array}\right. } \end{aligned}$$
(2)

For practical purposes, K is chosen as small as possible while ensuring the connectivity of the corresponding graph G. The connectivity property is required when calculating the acquisition functions (Sect. 2.3). We symmetrize the sparse weight matrix to obtain our final weight matrix by redefining \(W_{ij}:= ({\bar{W}}_{ij} + {\bar{W}}_{ji})/2\). Note that W is sparse, symmetric, and non-negative (i.e., \(W_{ij} \geqslant 0\)).

2.2 Graph Learning

With a graph G(XW) constructed as described in the previous section, we now describe a graph-based approach for semi-supervised learning (SSL) and present previous work in this field. Assume we have some observations of the ground-truth labels on a subset of vertices \(Z_0\subset Z\). Let \(y^\dag {:}\; Z_0 \rightarrow \{1,2,\cdots ,n_c\}\) be the ground-truth labeling function that maps each index \(j \in Z_0\) to exactly one class label \(y_j^\dag =y^\dag (j)\in \{1,2,\cdots ,n_c\}\). The corresponding one-hot encoding mapping is \({\textbf{y}}^\dag {:} \; Z_0\rightarrow \{e_1,e_2,\cdots ,e_{n_c}\}\) defined by \({\textbf{y}}^\dag (j) = e_{y^\dag (j)}\), where \(e_k\) is the \(k \text {th}\) standard basis vector with all zeros except a 1 at the \(i\text {th}\) entry. The goal for the semi-supervised learning task is to predict the labels of the unlabeled vertices \(x_i\in X\), \(i\in Z-Z_0\).

Important geometric information about the dataset X is encoded in graph Laplacian matrices [2, 42] defined on G. Define \(d_j = \sum _{k\in Z} W_{jk}\) to be the degree of node j and let D be the diagonal matrix with diagonal entries \(d_1,d_2,\cdots ,d_N\). While there are various graph Laplacians one could define [42], we use the unnormalized graph Laplacian matrix \(L_u = D - W\) in this paper.

The inferred classification of unlabeled vertices comes from thresholding a continuous-valued node function \({\textbf{u}}{:}\; Z\rightarrow {\mathbb {R}}^{n_c}\). In particular, the predicted label of \(x_i\in X\) is \(y_i = {{\,\mathrm{arg\,max}\,}}\{u_1(i),u_2(i),\cdots ,u_{n_c}(i)\}\), where \(u_k(i)\) is the \(k \text {th}\) entry of \({\textbf{u}}(i)\). Consider an \(N\times n_c\) matrix U, whose \(i {\text {th}}\) row is \({\textbf{u}}(i)\); that is, each node function \({\textbf{u}}\) can be identified by a matrix U whose \(i\text {th}\) represents the output of \({\textbf{u}}\) at node i. The graph-based SSL model that we consider obtains an optimal \({\hat{U}}\) (i.e., optimal node function \({\hat{{\textbf{u}}}}\)) by solving an optimization problem of the form

$$\begin{aligned} \begin{aligned} {\hat{U}}&= \mathop {\mathrm{arg\,min}}\limits _u J_\ell (U, {\textbf{y}}^\dag )\\&=\mathop {\mathrm{arg\,min}}\limits _{U\in {\mathbb {R}}^{N\times n_c}} \frac{1}{2}\langle U, LU\rangle _{\mathrm {F}} + \sum _{j\in Z_0}\ell \left( {\textbf{u}}(j),{\textbf{y}}^\dag (j)\right) , \end{aligned} \end{aligned}$$
(3)

where \(\langle \cdot ,\cdot \rangle _{\mathrm {F}}\) is the Frobenius inner product for matrices. The loss function \(\ell{:}\;{\mathbb {R}}^{n_c}\times {\mathbb {R}}^{n_c}\rightarrow {\mathbb {R}}\) measures the difference between the prediction \({\textbf{u}}(i)\) and the ground-truth \({\textbf{y}}^\dag (i)\) for i in the observation set \(Z_0\). While there are several choices for the loss function, we simply apply a hard-constraint penalty

$$\begin{aligned} \ell _h(x,y) = {\left\{ \begin{array}{ll} +\infty , \text { if } x\ne y,\\ 0, \text { if } x= y. \end{array}\right. } \end{aligned}$$
(4)

This hard-constraint penalty function \(\ell _h\) forces the minimizer \({\hat{U}}\) to be exactly the same as the ground-truth \({\textbf{y}}^\dag\) on the observation set \(Z_0\). This SSL scheme was introduced in [45] and we refer to it as Laplace learning. We can reorder the vertices to be able to write \(U = \begin{bmatrix}U_l\\ U_u\end{bmatrix}\), where \(U_l\) corresponds to the submatrix of U whose rows are in the labeled (observed) index set \(Z_0\) and \(U_u\) similarly corresponds to the submatrix of U whose rows are in \(Z - Z_0\) (i.e., the unlabeled index set). Likewise, we can split the weight matrix W and degree matrix D into labeled and unlabeled submatrices as

$$\begin{aligned} W = \begin{bmatrix} W_{ll}&{}W_{lu}\\ W_{ul}&{}W_{uu} \end{bmatrix}, \quad D = \begin{bmatrix} D_{ll}&{}D_{lu}\\ D_{ul}&{}D_{uu} \end{bmatrix}. \end{aligned}$$
(5)

As a result of the hard-constraint labeling of Laplace learning, \({\hat{U}}_l\) is fixed as the one-hot encodings of the observations on the labeled set \(Z_0\). According to [45], the optimizer \({\hat{U}}_u\) of Laplace learning can be calculated explicitly as

$$\begin{aligned} {\hat{U}}_u = (D_{uu}-W_{uu})^{-1}W_{ul}{\hat{U}}_l. \end{aligned}$$
(6)

The Laplace learning gives a harmonic solution \({\hat{{\textbf{u}}}}\) on the graph G. It infers the sum-to-one property of the graph Laplace learning output node function \({\hat{{\textbf{u}}}}\). If the ground truth labels are given in one-hot forms, for any node \(i \in Z\), we have \({\hat{u}}_k(i)\geqslant 0,\,k=1,2,\cdots ,n_c\) and \(\sum _{k=1}^{n_c}{\hat{u}}_k(x) = 1\), where \({\hat{{\textbf{u}}}}(i) = ({\hat{u}}_1(i),{\hat{u}}_2(i),\cdots ,{\hat{u}}_{n_c}(i))\). With this property, at node \(i\in Z\), \(u_k(i)\) can be treated as the predicted probability that node i belongs to the class k.

There are various other graph SSL schemes based on the optimization problem (3). The main difference between them and the Laplace learning scheme is the choice of penalty function \(\ell\). In this paper, we use the multiclass Gaussian regression (MGR) model [4, 26] which applies an \(L_2\)-norm penalty function \(\ell _\gamma (x,y) = \frac{1}{2\gamma ^2}\Vert x-y\Vert _2^2\). The MGR model is an approximation of the graph Laplace learning model in the sense that \(\gamma \rightarrow \infty\).

Denote by \({\mathcal {G}}(N)\) the computation cost of a Laplace leaning process on graph \(G = (X,W)\) with the labeled set \(Z_0\). Assume the graph is constructed with the KNN sparse similarity matrix W (Sect. 2.1) and that the size of labeled set is much smaller than the number of nodes, i.e., \(|Z_0|\ll |Z|\). Recall the computational complexity of the conjugate gradient method to solve the linear equation \(Ax = b\) (Chapter 10 of [40]) is \(O(m\sqrt{\kappa })\), where m and \(\kappa\) are the number of non-zero entries and the condition number of matrix A, respectively. If we solve Eq. (6) by the conjugate gradient method, we have \({\mathcal {G}}(N) = O(KN\sqrt{\kappa _L})\), where \(\kappa _L\) is the condition number of the graph Laplacian L.

2.3 Active Learning: Review of Acquisition Functions

Active learning improves the performance of the underlying SSL methods by carefully selecting unlabeled points to hand label via the use of an oracle or human in the loop. The aim of active learning is to identify which unlabeled inputs (\(x_i\in X\) with \(i\in Z-Z_0\)) for which it would be the “most helpful” to have a human in the loop observe and obtain labels. The core of active learning is the acquisition function \({\mathcal {A}}{:}\, Z - Z_0 \rightarrow {\mathbb {R}}\), which evaluates the benefit of obtaining the label of each unlabeled datapoint. The query set \({\mathcal {Q}} \subset Z - Z_0\) of unlabeled points that are to be labeled is chosen via the optimization of an acquisition function. Note that in this work, we use Laplace learning [45] as the underlying semi-supervised classifier. Figure 1 is the flowchart of our active learning process based on the graph Laplace learning classifier. The acquisition functions we introduce are designed for the graph learning classifier, including the Uncertainty (UC) [5, 32, 36], Model-Change (MC) [31, 32], Variance Minimization (VOpt) [22], and Model-Change Variance Optimal (MCVOpt) acquisition functions [33].

Fig. 1
figure 1

Flowchart of the active learning process. The active learning loop is based on a fixed graph. In each step, we apply Laplace learning on the graph and update the labeled set with a query set selected based on the current acquisition function values. It should be noticed that it might need the human-in-the-loop process to obtain the label of the selected query set in each step of the active learning process

The UC acquisition function \({\mathcal {A}}_\text {UC}\) quantifies the uncertainty of the classifier \({\textbf{u}}\) on each unlabeled node [5, 32, 36] by the current classifier’s output value for that unlabeled node, (i). Uncertainty sampling thus prioritizes querying points that are close to the current classifier’s decision boundaries. Various methods can be applied to quantify the uncertainty based on \({\hat{{\textbf{u}}}}\). Here we consider the smallest-margin uncertainty acquisition function:

$$\begin{aligned} {\mathcal {A}}_\text {UC}(i) = 1 - \left( u_{k_0}(i) - \max _{k=1,2,\cdots ,n_c; k\ne k_0}u_k(i)\right) , \end{aligned}$$
(7)

where \(i \in Z\), \(k_0 = {{\,\mathrm{arg\,max}\,}}_{j=1,2,\cdots ,n_c}u_k(x)\).

We need some prior knowledge before introducing the formula for the VOpt, MC, and MCVOpt acquisition functions. \(L = L_u\) is a semi-positive definite matrix. By adjusting the number of nearest neighbors K considered at each vertex, we can guarantee that the graph G is connected. Further, the corresponding Laplacian matrix L has exactly one zero eigenvalue. We may order the eigenvalues of L as \(0=\lambda _1 < \lambda _2\leqslant \cdots \leqslant \lambda _N\), and then consider the truncated decomposition of L with the smallest \(M < N\) eigenvalues as \({\hat{L}} = V\Lambda V^\top\), where \(\Lambda \in {\mathbb {R}}^{M\times M}\) is a diagonal matrix with diagonal entries \(\lambda _1, \lambda _2, \cdots , \lambda _M\) and \(V = [v^1,v^2,\cdots ,v^M]\in {\mathbb {R}}^{N\times M}\) is the matrix of corresponding eigenvectors. Futhermore, we define \(v_k\) as the \(k \text {th}\) column of \(V^\top\), \(A = V^\top U\in {\mathbb {R}}^{M\times n_c}\) and \({\hat{A}} = V^\top {\hat{U}}\in {\mathbb {R}}^{M\times n_c}\), where \({\hat{U}} = {{\,\mathrm{arg\,min}\,}}_{U\in {\mathbb {R}}^{N\times n_c}} {\tilde{J}}_\ell (A,{\textbf{y}}^\dag )\).

In addition, in the interest of the numerical stability and similar to [33], we replace the hard-constraint penalty function \(\ell _h\) with the MGR penalty function \(\ell _\gamma (x,y) = \frac{1}{2\gamma ^2}\Vert x-y\Vert _2^2\) for MC acquisition function calculations (but not for the underlying SSL model). The MGR penalty \(\ell _\gamma\) is a numerically stable perturbation of the hard-constraint penalty function \(\ell _h\). When \(\gamma \rightarrow 0^+\), \(\ell _\gamma \rightarrow \ell _h\). Then we define the MGR correlation matrix by

$$\begin{aligned} C_\text {MGR} = \left( \Lambda + V^\top \left( \frac{1}{\gamma ^2}P^\top P\right) V\right) ^{-1}, \end{aligned}$$
(8)

where \(P \in {\mathbb {R}}^{|Z_0|\times N}\) is a projection matrix onto the indices corresponding to the labeled set \(Z_0\). When the graph G is connected, the matrix \(C_\text {MGR}\) is guaranteed to exist, i.e., the matrix \((\Lambda + V^\top (\frac{1}{\gamma ^2}P^\top P)V)\) is invertible.

The VOpt acquisition function \({\mathcal {A}}_\text {VOpt}\) is developed to minimize the expected error of the prediction results [22]. If we acquire the label of the unlabeled node \(i\in Z\setminus Z_0\) and use labels of \(Z\cup \{i\}\) to process the graph learning, then the expected prediction error on the set \(Z\setminus (Z_0\cup \{i\})\) can be computed as follows:

$$\begin{aligned} {\mathbb {E}}\left( \sum _{i\in Z\setminus (Z_0\cup \{k\})}\Vert {\textbf{u}}(i) - {\textbf{y}}^\dag _i\Vert ^2\right) = \text {Tr}(L_k^{-1}), \end{aligned}$$
(9)

where \(L_k^{-1}\) is the submatrix of the graph Laplacian L with both row and column indices \(Z\setminus (Z_0\cup \{k\})\). Approximating the matrix \(L_k^{-1}\) by the truncated decomposition, we have the VOpt acquisition function

$$\begin{aligned} {\mathcal {A}}_\text {VOpt}(k) = \frac{1}{\gamma ^2 + v_k^\top C_\text {MGR} v_k}\Vert C_\text {MGR}v_k\Vert ^2_2. \end{aligned}$$
(10)

The MC and MCVOpt acquisition functions [31,32,33] were developed based on the look-ahead model with the objective energy

$$\begin{aligned} J_\ell ^{k,{\hat{{\textbf{y}}}}_k}(U,{\textbf{y}}^\dag ; \; {\hat{{\textbf{y}}}}_k) = \frac{1}{2}\langle U, LU\rangle _{\mathrm {F}} + \sum _{i\in Z_0}\ell \left( {\textbf{u}}(i),{\textbf{y}}^\dag (i)\right) + \ell \left( {\textbf{u}}(k),{\hat{{\textbf{y}}}}_k\right) , \end{aligned}$$
(11)

where \({\hat{{\textbf{y}}}}_k\) is the one-hot pseudo-label for the currently unlabeled node \(k \in Z - Z_0\). Practically, \({\hat{{\textbf{y}}}}_k\) is the one-hot thresholding vector of \({\hat{{\textbf{u}}}}(k)\) (the \(k \text {th}\) column of \({\hat{U}}\)). Let \({\hat{U}}^{k,{\hat{{\textbf{y}}}}_k} = {{\,\mathrm{arg\,min}\,}}J_\ell ^{k,{\hat{{\textbf{y}}}}_k}(U,{\textbf{y}}^\dag ; \; {\hat{{\textbf{y}}}}_k)\). With the spectral truncation, the MC acquisition function is given by

$$\begin{aligned} \begin{aligned} {\mathcal {A}}_\text {MC}(k)&= \Vert {\hat{U}}^{k,{\hat{{\textbf{y}}}}_k} - {\hat{U}}\Vert _{\mathrm F}\\&= \frac{1}{\gamma ^2 + v_k^\top C_\text {MGR}v_k}\Vert C_\text {MGR}v_k\Vert _2\Vert {\hat{A}}^\top v_k-{\hat{{\textbf{y}}}}_k\Vert _2. \end{aligned} \end{aligned}$$
(12)

Similarly, the MCVOpt acquisition function can be written as

$$\begin{aligned} {\mathcal {A}}_\text {MCVOpt}(k) = \frac{1}{\gamma ^2 + v_k^\top C_\text {MGR}v_k}\Vert C_\text {MGR}v_k\Vert ^2_2\Vert {\hat{A}}^\top v_k-{\hat{{\textbf{y}}}}_k\Vert _2. \end{aligned}$$
(13)

3 Batch Active Learning Pipeline for Image Segmentation

This section introduces our pipeline with the graph Laplace learning classifier and batch active learning. Given an image for the segmentation task, we extract a feature vector for each pixel and construct a similarity graph based on the cosine similarity between these feature vectors according to Sect. 2.1. Then we apply the graph Laplace learning (Sect. 2.2) with labeled pixels selected by our batch active learning approach, LocalMax. The node classification on the graph gives a segmentation of the given image.

3.1 Image Segmentation Pipeline

We develop an image segmentation pipeline with graph learning and batch active learning approaches mentioned in Sects. 2 and 3.2. The first step for pixel classification is to associate each pixel with a feature vector. One can consider simply using the pixel values of all channels as the corresponding feature vector. In this case, the dimension of a feature vector is the same as the number of channels of the image. While this simple construction is straightforward, it is useful to include neighborhood information of each pixel for feature extraction.

For pixel i, consider a \((2k+1)\times (2k+1)\) neighborhood patch \(P_i\) centered at pixel i. If pixel i is near the boundary of the image, apply reflection padding to expand the image before taking the neighborhood patch. Inspired by the non-local means method [10], we consider a \((2k+1)\times (2k+1)\) discrete Gaussian kernel G with \(\sigma = k/2\). Specifically,

$$G(i,j) = \frac{\alpha }{2\uppi \sigma ^2}\exp {\left( -\frac{(i-k-1)^2 + (j-k-1)^2}{2\sigma ^2}\right) },$$

where \(\alpha\) is a constant such that \(\sum _{i,j=1}^{2k+1}G(i,j) = 1\). The weighted patch is then defined by

$$P_i^w(i,j) = P_i(i,j)G(i,j)$$

for each pair of pixels \(i,j=1,2,\cdots ,2k+1.\) This feature process is illustrated by Fig. 2.

Fig. 2
figure 2

The feature extraction process for a single pixel. The feature vector is a Gaussian-weighted patch centered on the pixel

We apply this non-local means weighting process to each of the C channels in the image. Flattening these weighted patches and concatenating them together gives the non-local means feature vector for pixel i. The dimension of the resulting non-local means feature vector for a given pixel is \(d=C(2k+1)^2\).

For a given image, we extract each pixel’s non-local means feature vector to get the feature vector set X (\(|X|= N\)). Then we build a similarity graph G(XW) based on the feature vector set S and with the sparse similarity weight matrix W via the KNN method based on these according to Sect. 2.1. For efficiency, we want to select a relatively small KNN parameter K while keeping the connectivity of the generated graph G. Throughout all of the experiments, we find that \(K=50\) is a reasonable value to ensure the connectivity and relative sparsity of the resulting similarity graph. In practice, we suggest that one can use a binary search to identify relatively small values of K that result in graph connectivity.

On the graph G, we randomly initialize a labeled node set and apply the LocalMax batch active learning to select a labeled set of nodes \(Z_0\) according to Sects. 2.3 and 3.2. Finally, we predict the labels of unlabeled nodes \(Z-Z_0\) on the graph G with the graph Laplace learning classifier based on the selected labeled node set \(Z_0\). This node classification on G gives a segmentation on the given image. The flowchart of our pipeline is Fig. 3.

Fig. 3
figure 3

Our graph-based active learning pipeline for the image segmentation task. Red box: feature extraction (Sect. 3.1); Blue box: Graph Construction (Sect. 2.1); Yellow box: Batch Active Learning (Sects. 2.3 and 3.2); Green box: Graph Learning (Sect. 2.2)

3.2 Batch Method

In the active learning process illustrated by Fig. 1, we select a query set according to a prescribed acquisition function \({\mathcal {A}}\). In the sequential active learning process, the query set \(Q\subset Z-Z_0\) is selected by

$$\begin{aligned} Q = \{k\},\,k=\mathop {\mathrm{arg\,max}}\limits _{k\in Z-Z_0}{\mathcal {A}}(k). \end{aligned}$$
(14)

For batch active learning with the batch size B, simply selecting the top-B maximizers of the acquisition function likely includes nodes that are connected in the graph. As an inductive bias, the graph Laplace learning would have similar outputs with neighboring labeled nodes. Therefore, it is redundant to sample neighbors in the graph.

We propose a batch active learning method named LocalMax. This method was originally developed for the classification task of SAR datasets [13]. We define the local maximum of a certain node function \({\mathcal {A}}{:} \,Z\rightarrow {\mathbb {R}}\) on a KNN-generated graph G according to Definition 1.

Definition 1

(Local max of a graph node function) Consider a KNN-generated graph \(G = (X,W)\), where X is the set of nodes indexed by Z and W is the edge weight matrix. For a graph node function \({\mathcal {A}}{:}\, Z\rightarrow {\mathbb {R}}\), \(k\in Z\) is a local maximum node if and only if for any j, \({\mathcal {A}}(k)\geqslant {\mathcal {A}}(j)\), if there is an edge between \(x_j\) and \(x_k\). Equivalently, \(k\in Z\) is a local maximum if and only if

$$\begin{aligned} {\mathcal {A}}(k)\geqslant {\mathcal {A}}(j),\,\forall j \text { s.t. }W_{jk}>0. \end{aligned}$$
(15)

The LocalMax batch active learning method selects the batch query set to be the top-B local maximums of the acquisition function \({\mathcal {A}}\) in the graph G. Algorithm 1 shows the process of the LocalMax batch active learning method. It should be noted that the batch size B can not be extremely large as there might not be enough local maximums of the discrete set of acquisition function values at a given iteration. Algorithm 1 has the maximal computational complexity O(KN), where K is the KNN parameter (Sect. 2.1) and N is the number of nodes in the graph G. Usually \(K\ll N\), the computational complexity is O(N).

figure a

This method has some important advantages. Practically, regions of high acquisition value will only have a small number of local maxima, so the LocalMax method obtains a batch of nodes from multiple regions of high acquisition. Due to the complicated structure of data, there are often many regions with high acquisition, so batches can be relatively large. This method also selects what the model predicts to be the most important point from the high-acquisition region.

According to Sect. 2.2, the computational cost of the graph Laplacian learning is \({\mathcal {G}}(N) = O(KN\sqrt{\kappa _L})\). If we want to sample a total number of M query nodes from the whole active learning process, the computational complexity of the sequential active learning process is \(M{\mathcal {G}}(N)\) while the LocalMax batch active learning with the batch size B has the computational complexity \(M/B[O(N)+{\mathcal {G}}(N)]\). This implies that the LocalMax batch active learning process is much more efficient than the sequential active learning, proportionally to the batch size B. This result is verified by experiments in Sect. 4.1.

4 Experiments and Results

This section shows the experiments and results of our graph-based active learning pipeline on the image segmentation tasks. We run two types of experiments. The first one compares LocalMax, the sequential active learning process, and the other two straightforward batch sampling approaches for active learning. The second one is the application of our pipeline on image segmentation tasks and the comparison with a similar semi-supervised approach proposed in [27]. Finally, we provide some comments about our experiment results.

We consider three datasets, Landsat-7, Urban, and Kennedy Space Center (KSC) in the following experiments. Here are brief introductions for these datasets.

  1. i.

    The Urban dataset was recorded in October 1995 by the Hyperspectral Digital Imagery Collection Experiment (HYDICE) over the urban area in Copperas Cove, TX, U.S. Zhu et al. [44] provided the ground truth labels with four classes: asphalt, grass, tree, and roof. This dataset contains a hyperspectral image with \(307\times 307\) pixels, each corresponding to a 2 square meters area. The raw image includes 210 channels, but we use the clean version with 162 channels after removing some channels due to dense water vapor and atmospheric effects. Figure 4 shows the raw image and ground truth labels of the Urban dataset.

  2. ii.

    The Landsat-7 dataset in this paper is a multispectral image of the Colville River (Alaska, USA) from the RiverPIXELS dataset [43], which provides paired Landsat and water-and-sediment labeled patches of size \(256\times 256\times 6\), where the 6 multispectral channels correspond to Blue, Green, Red, Near IR, Shortwave IR 1, and Shortwave IR 2. Each pixel in the image covers roughly 900 m2. The RiverPIXELS provides ground-truth labels of this image with three classes, land, water, and bare sediment.

  3. iii.

    The Kennedy Space Center Dataset is a hyperspectral image at the Kennedy Space Center (KSC) in Florida, acquired by the NASA AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) instrument. This hyperspectral image has a size \(512\times 614\). The raw image includes 224 channels, while we are using the clean version with 176 channels after removing water absorption and low SNR channels. There are 314 368 pixels in this dataset, while only 5 211 (around \(1.66\%\)) of them have ground-truth labels. The ground-truth labels include 13 classes of different land coverings in this region.

We consider the overall accuracy (OA) to evaluate the performance of different methods. The definition of the OA is

$$\begin{aligned} \text {OA} = \frac{\text {Number of Correctly Classified Pixels}}{\text {Total Number of Pixels}}. \end{aligned}$$
(16)

It should be noticed that, according to the KNN parameter selection process mentioned in Sect. 3.1, the number of neighborhoods selected to construct the graph is \(K=50\).

4.1 Comparison between LocalMax and Sequential Active Learning: Accuracy and Efficiency

In this section, we only conduct our experiments on the Urban dataset. We apply the hyperspectral pixel values as the feature vector, i.e., for each pixel, the feature vector corresponding to it is the vector of 162 channels. With four acquisition functions, we sample up to 134 pixels (\(0.15\%\) of all pixels) with different active learning sampling methods. We initialize the labeled set with 10 random pixels in each class (in total 40 pixels) and sample extra 94 pixels according to the active learning methods. For a certain acquisition function \({\mathcal {A}}\), we consider four sampling methods: Sequential, Random, Top-Max, and LocalMax, the last three of which are batch active learning with a batch size B.

  • Sequential sampling selects the global maximum node of \({\mathcal {A}}\) to update the current labeled set. The query set \({\mathcal {Q}}=\{k^*\}\) and \(k^*= {{\,\mathrm{arg\,max}\,}}_{k\in Z-Z_0}{\mathcal {A}}(k)\).

  • Random sampling selects a batch of B unlabeled nodes according to the uniform distribution on the unlabeled node set \(Z-Z_0\).

  • Top-Max sampling selects a batch of B unlabeled nodes as the top-B maximum of \({\mathcal {A}}\), i.e., the query set \({\mathcal {Q}} = \{i_1,i_2,\cdots ,i_B\}\subset Z-Z_0\) where \(i_1 = {{\,\mathrm{arg\,max}\,}}_{i\in Z-Z_0}{\mathcal {A}}(i)\) and \(i_b = {{\,\mathrm{arg\,max}\,}}_{i\in Z-Z_0-\{i_1,\cdots ,i_{b-1}\}}{\mathcal {A}}(i)\) for \(b=2,3,\cdots ,B\).

  • LocalMax sampling is the method we proposed in Sect. 3.2.

Figure 5 shows the curves between the accuracy and the number of labeled pixels for the four acquisitions, and Table 1 shows the time consumption and accuracy values for label rates \(0.1\%,0.15\%\). From these experiments, we conclude the following.

  1. i.

    Accuracy Performance Sequential active learning has the best accuracy performance according to Fig. 5, which shows its higher accuracy for almost all numbers of labeled pixels. Our batch active learning method LocalMax has the second-best accuracy values and performs almost identically as the sequential one, especially for larger numbers of labeled pixels. This is also verified by Table 1 which shows in bold the top-2 accuracy values. LocalMax consistently shows accuracies in the top 2 and sometimes performs better than Sequential.

  2. ii.

    Efficiency Performance According to the timings in Table 1, LocalMax takes approximately the same time as the Random and Top-Max sampling methods while the Sequential active learning takes around eight times longer. The time multiplier 8 is close to the theoretical multiplier 10 (same as the batch size B) according to Sect. 3.2.

In summary, LocalMax batch active learning is much more efficient than Sequential active learning without significantly sacrificing accuracy.

Table 1 Comparison between different active learning sampling methods
Fig. 4
figure 4

Urban Dataset. a shows the raw hyperspectral image we used for experiments. b shows the ground-truth labels. Label information: asphalt (navy blue), grass (light blue), trees (yellow), roof (red)

Fig. 5
figure 5

Comparison between batch active learning methods and sequential active learning for four acquisition functions. Each panel includes four curves, of which the X-axis is the number of labeled pixels and the Y-axis is the accuracy. The blue, yellow, green, and red curves correspond to the Random, Top-Max, LocalMax, and Sequential sampling method, respectively, for the active learning process. More details on accuracy values and time consumption are shown in Table 1. Descriptions of each sampling method are in Sect. 4.1

4.2 Semi-supervised Image Segmentation with Low Label Rates

Here we perform segmentation experiments on all three datasets, Landsat-7, Urban, and KSC. The performances are evaluated on the OA as a function of the amount of labeled data used in training. Our method is compared with the graph-based semi-supervised method, abbr. GL-SSL, proposed in [27]. There are some differences between our graph learning method and the GL-SSL proposed in [27], the foremost being that our method uses the KNN approach to build a sparse similarity graph while the GL-SSL is based on a fully connected graph and uses the Nyström extension method to approximate the graph Laplacian matrix. In addition, our approach solves the optimal node function by minimizing the energy function of graph Laplace learning while GL-SSL is to minimize a regularized Ginzburg-Landau functional [17, 28, 29]. We have reimplemented the GL-SSL codeFootnote 1 in Python and the corresponding results are based on our code. The parameters selected for GL-SSL are based on their recommendations with some fine-tuning. To avoid the bloated content, we only provide results of GL-SSL with the randomly selected training dataset. We also test the performance of GL-SSL with the training set selected by our active learning approach. Since the active learning process introduced in this paper is designed for graph Laplace learning, the OA performance of GL-SSL with an active learning training set is almost the same as that of a random training set. The OA result of our method or GL-SSL with the randomly sampled training dataset is the average OA value of 15 times random samples. Each time the random sampling process starts with 1 random pixel in each class, then randomly chooses the rest of the labeled pixels with equal probability.

Results of the Landsat-7 dataset are shown in Table 2 and Fig. 6. We use each pixel’s non-local means feature vector to build the similarity graph. The neighborhood patch size is \(7\times 7\), which leads to a 294 dimensional feature vector for each pixel. We sample up to 200 pixels (\(0.3\%\) of all pixels) based on the random initialization of one pixel in each class (three pixels in total for the initialization) via the LocalMax batch active learning approach of the batch size 20. In addition, we sample up to 3 300 (around \(5\%\) of all pixels) labeled pixels based on the random initialization of 10 labeled pixels in each class (30 in total) and batch size 100. According to OA values shown in Table 2, the UC and MCVOpt acquisition functions reach a better accuracy with \(0.3\%\) labeled pixels than randomly selecting \(5\%\) labeled pixels. Figure 6 shows the segmentation result of our graph-based batch active learning method with UC acquisition functions.

Table 2 The OA of a landsat-7 multispectral image (65 536 pixels)
Fig. 6
figure 6

The ground-truth and segmentation result of the Landsat-7 multispectral image from the RiverPIXELS dataset. a ground-truth labels; b segmentation result with \(0.3\%\) labeled pixels sampled according to LocalMax with a batch size of 20 and the UC acquistion function. The similarity graph is based on non-local means feature vectors with the neighborhood patch of size \(7\times 7\)

Results of the Urban dataset are shown in Table 3 and Fig. 7. For these experiments, we use the hyperspectral pixel values as the feature vector for each pixel. We sample up to 286 pixels (\(0.3\%\) of all pixels) based on the random initialization of one pixel in each class (four pixels in total for the initialization) with LocalMax with the batch size 10. In addition, we sample 4 700 (around \(5\%\) of all pixels) with LocalMax with the batch size 100 based on the random initialization of 10 pixels in each class (40 in total). It can be seen that LocalMax batch active learning with the UC acquisition function attains an accuracy of \(97.30\%\) with only \(0.3\%\) labeled pixels, which is similar to the accuracy \(97.76\%\) with \(10\%\) randomly selected labeled pixels.

Table 3 The OA of urban dataset (94 249 pixels)
Fig. 7
figure 7

The segmentation result of the Urban dataset with \(0.3\%\) labeled pixels sampled according to LocalMax batch active learning with the batch size 10. a UC acquisition function; b MCVOpt acquistion function. Label information: asphalt (navy blue), grass (light blue), trees (yellow), and roof (red). The ground-truth labels are in Fig. 4b

On the KSC dataset, we only consider and calculate our results on the 5 211 labeled pixels (around \(1.66\%\)) out of 314 368 total pixels. It is visualized in Fig. 8a. We segment this hyperspectral image into 13 classes according to ground-truth labels. We sample up to 325 pixels (\(6\%\) of all pixels with ground-truth labels) based on the random initialization of 1 pixel in each class (13 pixels in total for the initialization). Table 4 shows the OA of the KSC dataset. Our LocalMax batch active learning with the UC acquisition function performs best. Figure 8b shows the segmentation result of our graph-based batch active learning method with UC acquisition functions.

Table 4 The OA of KSC Dataset (5 122 pixels with ground-truth labels)
Fig. 8
figure 8

The ground-truth and segmentation result of the KSC dataset. a ground-truth labels of 5 211 pixels, including 13 classes; b segmentation result with \(6\%\) labeled pixels sampled according to LocalMax the batch active learning with the batch size 10 and the UC acquisition function

In summary, with a very low label rate (less than \(0.5\%\)), our method has a relatively good performance on the OA while GL-SSL does not perform well. Our graph-based active learning approach can significantly reduce the number of labeled pixels required for the semi-supervised image segmentation task. According to Tables 2 and 3, the graph Laplace learning with \(0.3\%\) labeled pixels selected by active learning has a similar accuracy performance to that with 5% to 10% randomly selected labeled pixels.

4.3 Comments About Our Experiments

In this part, we comment on our experiments in Sects. 4.1 and 4.2. With a relatively small percentage of training data, the MCVOpt acquisition function performs similarly or even better than the UC acquisition function. This is evident in Table 2. With 0.1% labeled data, the UC has only 25% accuracy, while other acquisition functions have more than 90%. When the percentage of training data increases, the UC acquisition function achieves superior OA. A reason for this phenomenon may lie in the difference between the tendencies of the different acquisition functions to exploration or exploitation in the active learning process. Roughly speaking, the UC acquisition function focuses on exploiting the classifier’s current decision boundaries pursuant to the current labeled set by querying data points along said decision boundaries between different classes. The other three acquisition function, MC, VOpt, and MCVOpt, focus more on exploration than exploitation. They are developed to explore the geometric structure of the entire dataset. At the beginning of the active learning process, it is more important to explore the inherent geometric or clustering structure of the entire dataset. When the labeled percentage or the current accuracy is relatively high, exploiting the decision boundary can be more helpful for increasing the overall classification accuracy. As an empirical summary, we recommend using the UC acquisition function, but we caution that the UC acquisition function is not necessarily always the best choice due to its tendency to exploit during the active learning process.

According to Fig. 5, there is a gap between the accuracies of LocalMax and the sequential active learning for a small number of labeled nodes. The main reason for such an initial gap is that a newly labeled node can significantly change the acquisition function’s value when the amount of labeled data is relatively small. In such a circumstance, LocalMax may not perform well since the next global maximum of the acquisition function might not be close to any local maximum of the current acquisition function. This gap does not necessarily remain throughout the active learning process, so the LocalMax batch sampling method can better approximate sequential sampling as the number of labeled data increases.

5 Conclusion

We propose a graph-based batch active learning pipeline for multi- and hyperspectral image segmentation. Our method showed excellent image segmentation skill using very low percentages of available training data. Compared with a similar graph-based image segmentation method proposed in [27], our method requires fewer labeled pixels to achieve better OA. This suggests that careful selection points to label through active learning is beneficial for this application of graph-based semi-supervised classification. In addition, we introduced a batch active learning approach, LocalMax, to select a batch of pixels in each step of the active learning process. According to our experiments, LocalMax batch active learning not only accelerates the process of sampling pixels but also retains similar accuracies as sequential active learning using the same acquisition function.