1 Introduction

In recent years, Graph Neural Network (GNN) and its variants have achieved great success on semi-supervised node classification task [5, 13, 19, 20, 44]. Given the same training set, different labeled nodes will lead to different GNN performances [4]. Clearly, more labeled nodes will help improve classification accuracy. However, instance labeling may require expert knowledge or be time-consuming [31], which makes it expensive and impractical to collect on a large scale in real scenarios. Furthermore, different from traditional active learning approaches that are focused on independent and identically distributed instances, graph topology provides additional structural information for node selection. Consequently, given a labeling budget B, how to select the most representative B nodes in the graph for GNN training is of great importance and non-trivial. Due to its importance, it has attracted a line of research work studying the topic of active learning on GNNs [4, 11, 53, 54]. Nevertheless, all these models either perform unsatisfactorily or cannot scale to large networks without a trade-off with accuracy. Therefore, how to design an effective and efficient GNN active learning framework that scales to large networks is still challenging.

Our newly defined node selection criterion are based on the observation that personalized PageRank (PPR) [28] values can be used to evaluate node importance during the feature propagation process of GNNs. As pinpointed in [22, 48], the recursive message-passing framework [12] encourages similar predictions between linked nodes, severely degrading the model performance. To tackle this issue, PPNP [20] and PPRGo [2] suggest decoupling the feature propagation from non-linear operations. They first calculate the approximate PPR matrix and subsequently propagate features on the PPR matrix. Following their success, many GNNs [6, 8, 21, 24, 35, 57] primarily propagate features using an approximation or an adaptation of the PPR matrix. Similarly, state-of-the-art node embedding approaches [10, 47, 49, 51, 56] select PPR as the proximity measure and decompose the PPR matrix to generate node representations. These studies have showcased the effectiveness of incorporating PPR values to capture crucial graph information during the feature propagation process. Furthermore, we notice that the standard feature propagation process \(\varvec{{\hat{A}}}\varvec{H}\) [19, 44] can be viewed as a random walk-based feature (information) propagation approach. However, a recent work, PPREI [55] reveals that PPR-based embeddings preserve more accurate topological information compared to random walk-based embeddings. This findings provides an explanation for the superior performance of PPR-based graph representation learning approaches over random walk-based alternatives from a topological perspective. Specifically, if the PPR value of node v with respect to node u is above the given threshold, then the feature information propagated from node u to v is highly likely to impact its final embeddings. Motivated by these observations, in this paper, we consider the importance measurement of a node from the perspective of importance and diversity during the feature propagation process of GNNs with a connection to PPR values.

Inspired by the above observation, in this paper, we present FICOM,Footnote 1 an effective and scalable GNN active learning framework. Our node selection criterion consist of two key components. The first component aims to select the most influential nodes during the feature propagation. PPR and its adaptation are shown to be favorable choices for incorporating multi-hop information during feature propagation. One crucial observation is that, in such GNNs, if there exist more nodes whose PPR values with respect to node u are above a tune-able threshold \(\theta \) (otherwise, the value will be negligible), then node u will become more influential. If \(\varvec{\pi }_u(v) > \theta \), we say node v is covered by node u. Then, we define the PPR coverage of u (resp. set S) as the total number of nodes that are covered by node u (resp. at least one node in set S). We further formulate a PPR coverage maximization problem that aims to find a set of B nodes with the maximum PPR coverage, providing a new perspective to elucidate the connection between PPR values and node importance. To further incorporate embedding information into our framework, the second component of our node selection criterion includes a diversity measure that aims to select influential nodes in the embedding space. Specifically, if \(dist(u,v) \le d\), we say node u and v are d-reachable to each other. Then, we define the embedding diversity of u (resp. set S) as the total number of nodes that are d-reachable to node u (resp. at least one node in set S). Our final objective function is the PPR-based diversified coverage which considers both the PPR coverage and embedding space diversity simultaneously. The goal is to select a set of B nodes that maximizes the objective function score. We show that the proposed objective function is submodular, and this property indicates a \((1-1/e)\)-approximate algorithm [27] by greedily selecting the node with the maximum marginal gain.

Even though combining node importance and diversity is not a new idea, and there exist several research studies on this topic, e.g., GRAIN [54]. We note that our solution advances existing solutions in two aspects. (i) Firstly, we introduce a novel perspective on node importance criterion by considering its influence during the GNN feature propagation process and the diversity it brings in the embedding space. This perspective establishes a connection between PPR values and node selection strategy. Through extensive experiments, we show the significant effectiveness of our node selection strategy in improving downstream GNN performances. (ii) Secondly, existing solutions still either struggle to scale to million-node graph datasets or require computationally expensive pre-processing steps, such as pre-computing the distance for almost all node pairs, which is impractical. In contrast, we propose an adaptive pruning strategy that addresses the efficiency challenges of the greedy selection algorithm on large graphs. This strategy allows us to overcome scalability issues and efficiently handle large graph datasets.

Specifically, the above greedy selection process needs to find the node that brings the largest marginal gain to the objective function score in each iteration, which takes prohibitive costs. To explain, the computation time complexity of the PPR coverage and the diversity takes \(O(nm\log n)\) and \(O(n^2F)\), respectively. This makes a brute force greedy solution limited to small graphs and unsuitable for million-node graphs. Intuitively, if we only select the element with the maximum value among all elements, we do not have to compute all exact values in advance, which leaves us room for speedup.

Our key contribution is a scalable framework that tackles the efficiency issue of the greedy selection algorithm. In particular, to overcome the above limitations, we propose a novel algorithm that firstly estimates the upper- and lower-bound of the objective score for each node. Then, in each iteration, we prune the less important nodes and derive a small subset of candidate nodes that could be the node bringing the maximum marginal gain. Next, we calculate the exact objective value of each candidate node and then select the node that brings the maximum marginal gain, which tremendously reduces the computational cost while still providing the \((1-1/e)\)-approximation guarantee.

Notice that in the first component, we need to compute the PPR coverage of each node. To speed up the PPR coverage computation, we propose to use approximate PPR scores to obtain the upper- and lower-bound of PPR coverage for each node. Then, we prune a large number of nodes with small coverage values. In the second component, we need to compute the distances between each node pair in the embedding space so that for each node v, we can derive the number of nodes whose embedding vectors have a distance to v’s embedding vector no larger than d. This incurs prohibitive running costs and memory costs on large datasets. Therefore, speeding up the computation of all-pair distances is even more challenging than that of the PPR coverage.

To reduce the distance computation cost, previous works like GRAIN [54] assume that the distances are pre-computed and are given.Footnote 2 In [54], they also mention that it is possible to pre-process by pruning a vast number of nodes with smaller degrees. However, as shown in our experiment, following such a choice may result in GNN performance degradation. We may turn to approximate solutions, where we only derive a theoretically bounded estimation of the number of nodes falling into the given distance threshold d, using popular techniques, like the bottom-k sketch, which will be explained in detail in Sect. 4. However, the issue with such a solution is that the initialization of the bottom-k sketch already incurs quadratic time, which is no better than a brute-force computation strategy. To tackle this challenging issue, we propose to adaptively construct the bottom-k sketch so that we can still derive the upper- and lower-bound of the diversity for each node even with a partially constructed bottom-k sketch. With these bounds, we can further adaptively prune a vast number of nodes with low diversity scores and thus making it scalable to large networks. In this way, FICOM enables us to select nodes to label from all nodes, avoiding a huge number of unnecessary computations while preserving high effectiveness. With our FICOM, it achieves high effectiveness, efficiency, and scalability.

In our experiment, we compare FICOM against alternative GNN active learning approaches on the semi-supervised node classification task. We conduct experiments on six benchmark networks. Four well-known message-passing GNNs, GCN, SGC, APPNP, and GCNII, are selected to evaluate the generalization ability of FICOM. Extensive experiments show that our FICOM consistently outperforms other competitors on semi-supervised node classification tasks on all datasets.

Our contributions can be summarized as follows:

  • We formulate the node selection process as an optimization problem, where the importance of a node is measured according to its influence during the feature propagation and the diversity it brings in the embedding space with a connection to PPR.

  • We present an efficient, effective, and scalable node selection framework via adaptive pruning strategies with \((1-1/e)\) approximation guarantee;

  • Extensive experimental results on six benchmark datasets using four GNNs reveal the efficiency and effectiveness of FICOM.

2 Preliminaries

2.1 Problem definition

Following previous work [4, 11], in this paper, we consider a pool-based active learning setting, where there exists a large pool of unlabeled instances \({\mathcal {U}}\) and a small set of labeled training set \({\mathcal {L}}\). Given a learning model \({\mathcal {M}}\), we aim to propose a query strategy that incrementally selects the most representative nodes from unlabeled set \({\mathcal {U}}\) and adds them into labeled set \({\mathcal {L}}\). The goal is to train model \({\mathcal {M}}\) on labeled set \({\mathcal {L}}\) and achieve the best possible performance on the semi-supervised node classification task. Table 1 lists the notations that are frequently used in this paper.

Definition 1

(Active Learning for GNN) Given an undirected graph \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})\) with \(|{\mathcal {V}}| = n\) nodes and \(|{\mathcal {E}}| = m\) edges, where \({\mathcal {V}}\) is partitioned into three disjoint subsets \(({\mathcal {V}}_{train}, {\mathcal {V}}_{val}, {\mathcal {V}}_{test})\), an adjacent matrix \(\varvec{A} \in {\mathbb {R}}^{n \times n}\), a feature matrix \(\varvec{X} \in {\mathbb {R}}^{n \times F}\), a label matrix \({\mathcal {Y}} \in \{0,1\}^{n \times c}\), a GNN model \({\mathcal {M}}\), a budget B, and a metric \(\varPhi \), the goal of active learning for GNN on the semi-supervised node classification task is to select a subset of nodes \({\mathcal {V}}_{{\mathcal {L}}}^* \subseteq {\mathcal {V}}_{train}\) to get the best performance on the given GNN model \({\mathcal {M}}\), i.e.,

$$\begin{aligned} {\mathcal {V}}_{{\mathcal {L}}}^* = \mathop {\arg \max }\limits _{{|{\mathcal {V}}_{{\mathcal {L}}}|=B}} \varPhi ( {\mathcal {Y}},{\mathcal {M}}( \varvec{A},\varvec{X},{\mathcal {V}}_{{\mathcal {L}}})). \end{aligned}$$
Table 1 Frequently used notations

2.2 Personalized pagerank

Given an undirected graph \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})\) with \(|{\mathcal {V}}| = n\) nodes and \(|{\mathcal {E}}| = m\) edges, let \(\varvec{A} \in {\mathbb {R}}^{n \times n}\) denote the adjacency matrix, \(\varvec{D} \in {\mathbb {R}}^{n \times n}\) denote the degree matrix, and \(\varvec{P} = \varvec{D}^{-1} \varvec{A}\) denote the transition matrix. Given a source node s, Page et al. [28] first defines Personalized PageRank (PPR) as follows:

$$\begin{aligned} \varvec{\pi }_s = (1-\alpha )\varvec{\pi }_s\cdot \varvec{P} + \alpha \varvec{e}_s, \end{aligned}$$
(1)

where \(\varvec{\pi }_s\) is the PPR vector with respect to source s. Specifically, an entry \(\varvec{\pi }_s(v)\) in PPR vector \(\varvec{\pi }_s\) indicates the PPR score of v with respect to s, where \(\alpha \) is the teleport probability and \(\varvec{e}_s\) is a one-hot vector with only \(\varvec{e}_s(s)=1\). The PPR vector \(\varvec{\pi }(s)\) can be obtained by recursively applying Eq. 1 until convergence as shown in [28].

2.3 Graph neural networks

Graph Neural Networks (GNNs) generalize the convolutional neural networks (CNNs) to graph data. Most of the existing GNN models follow a message-passing architecture [12], which iteratively updates the representation of a node by aggregating the representations of its neighbors. After h iterations, the representation of each node will capture the information of its h-hop neighborhood, and the representation of node v in the k-th layer GNN, denoted as \(\varvec{h}_v^{(k)}\), can be defined in an iterative manner:

$$\begin{aligned} \varvec{h}_v^{(k)} = COM\left( \varvec{h}_v^{(k-1)},AGG(\{\varvec{h}_u^{(k-1)}:u \in {\mathcal {N}}(v)\})\right) . \end{aligned}$$

where AGG denotes the aggregation function, COM denotes the combination function. Next, we briefly review several well-known GNN models that will be used to evaluate different GNN active learning methods.

GCN: The vanilla GCN [19] adds self loops to the original adjacent matrix and utilizes the normalized transition matrix \(\varvec{{\hat{A}}}\) for feature propagation, where \(\varvec{{\hat{A}}} = (\varvec{D}+\varvec{I})^{-1/2}(\varvec{A}+\varvec{I})(\varvec{D}+\varvec{I})^{-1/2}\) is the symmetric transition matrix with self-loop and \(\varvec{D}_{ii}=\varSigma _{j}\varvec{A}_{ij}\) is the diagonal node degree matrix. The node representations can be updated according to the following layer-wise propagation rule:

$$\begin{aligned} \varvec{H}^{(k+1)} = \sigma (\varvec{{\hat{A}}}\varvec{H}^{(k)}\varvec{W}^{(k)}), \end{aligned}$$

where \(\sigma \) is the ReLU activation function, i.e., \(\sigma (\varvec{M})_{ij} = \max \{0,\varvec{M}_{ij}\}\); \(\varvec{H}^{(0)}\) is initialized to be \(\varvec{X}\), i.e., the original feature matrix.

SGC: Message passing GNN smoothes the node representations among the node and its neighbors. Therefore, it is likely to make similar predictions between linked nodes, resulting in degraded performance. This phenomenon is called over-smoothing [22]. SGC [44] alleviates this problem by removing the nonlinear operations and collapsing weight matrices. The final k-layer classifier is:

$$\begin{aligned} \varvec{Z} = softmax(\varvec{{\hat{A}}}^{(k)}\varvec{X}\varvec{W}), \end{aligned}$$

where \({softmax}(\varvec{M})_{ij} = \exp (\varvec{M}_{ij})/(\sum _k \exp (\varvec{M}_{ik}))\).

APPNP: Decoupling feature propagation from non-linear transformation leverages a larger neighborhood for each node. APPNP [20] utilizes a PPR-based propagation scheme. The propagation process can be expressed by a power iteration, i.e.,

$$\begin{aligned} \varvec{H}^{(k+1)} = (1-\alpha )\varvec{{\hat{A}}}\varvec{H}^{(k)} + \alpha \varvec{H^{(0)}}, \end{aligned}$$

where \(\varvec{H}^{(0)}=f_{\varvec{W}}(\varvec{X})\), and \(f_{\varvec{W}}(\cdot )\) denotes a simple fully connected neural network parameterized by \(\varvec{W}\).

GCNII: Formally, the k-th layer of GCNII [5] is:

$$\begin{aligned} \begin{aligned} \varvec{H}^{(k+1)} =&\left( (1-\alpha )\varvec{{\hat{A}}}\varvec{H}^{(k)} + \alpha \varvec{H^{(0)}}\right) \\&\left( (1-\beta _k)\varvec{I}_n + \beta _k \varvec{W}^{(k)}\right) , \\ \varvec{H}^{(k+1)} =&\sigma \left( \varvec{H}^{(k+1)} \right) , \end{aligned} \end{aligned}$$
(2)

where \(\varvec{H}^{(0)}=f_{\varvec{W}}(\varvec{X})\), and \(\beta _k\) is a hyper-parameter. Firstly, it employs residual connection that ensures the representations retain information from the input, i.e., the \(\alpha \varvec{H}^{(0)}\) term in the first multiplier in Eq. 2. Secondly, it adds an identity matrix \(\varvec{I}_n\) to the weight matrix \(\varvec{W}\) as shown in the second multiplier in Eq. 2. These techniques enable GCNII to express a polynomial filter with arbitrary coefficients and allow GCNII to be deeper architecture, e.g., 64 layers, without harming the performance, significantly relieving the over-smoothing problem in GNNs.

2.4 Existing solutions

There are roughly four categories of GNN active learning methods: combination methods, clustering methods, reinforcement learning (RL) methods, and influence maximization (IM) methods. Next, we briefly review the existing works.

Combination methods: The methods in this category generally follow the standard active learning settings. AGE [4] and ANRMAB [11] define the uncertainty [32] and representativeness [34] to evaluate the importance of each node. Intuitively, larger information entropy brings more uncertainty and larger information density or graph centrality means more representative in the embedding space. In their objective function, AGE applies time-sensitive parameters for different measures while ANRMAB adopts a multi-armed bandit method for the node selection strategy. ALG [53] further proposes a reception field maximization problem and utilizes a greedy approach to obtain an approximate solution under the balanced cluster constraint. However, all these methods still provide sub-optimal model performance or cannot scale to large networks.

Clustering methods: Another idea in GNN active learning is to perform clustering on node embeddings. FeatProp [46] first computes pair-wise distances of node embedding vectors that are propagated by SGC. Then, it applies K-Means clustering and selects nodes that are closest to these cluster centers.

RL methods: There also exist methods that combine active learning with Reinforcement learning (RL). GPA [16] formalizes the selection process as a Markov decision process, learning the selection strategy with RL algorithms. We note that GPA needs to train the policy network on several fully labeled source graphs before generalizing to a new query graph, which is unsuitable for our setting.

IM methods: One of the recent methods, GRAIN [54], connects social influence maximization with GNN active learning. It combines the feature influence model proposed in [36] and diversity function to measure the representativeness of each node in \({\mathcal {V}}_{train}\), presenting a GNN active learning approach from the perspective of influence maximization. However, the influence of a node u to v defined in their problem focuses more on the impact of the input feature of u to the aggregated feature of v. In particular, given the same change of the input feature of a node u, the larger the change of the aggregated feature of v is, the higher the influence of u to v is. Such a computation is rather expensive and it requires such computations for all node pairs in \({\mathcal {V}}_{train}\). Besides, when considering the diversity, GRAIN adopts a complicated formation that is difficult to compute or even challenging to derive an unbiased estimation, limiting its scalability on large networks. In contrast, our FICOM focuses more on the influence during the feature propagation from the perspective of personalized PageRank and we can support highly efficient algorithms that derive the scores on large-scale networks. Besides, to compute the diversity, we derive an unbiased estimation for each node and get upper- and lower-bound for each node. Then, during the greedy selection, we adaptively prune nodes with small diversity scores. Experimental results further show that FICOM is more effective than GRAIN on all datasets on all GNN models, and gains up to \(2.41\%\) lead in terms of accuracy on GCN, which is a quite significant improvement.

3 Diversified coverage maximization

As discussed in [22, 48], recursively message passing [12] operations lead to similar predictions between linked nodes, which severely degrades the model performance. To tackle this issue, PPNP [20] and PPRGo [2] suggest decoupling the feature propagation from non-linear transformation operations. They first calculate the approximate PPR matrix, and subsequently propagate features on the PPR matrix. Formally, their models can be defined as follows:

$$\begin{aligned} \varvec{Z}=softmax(\varvec{{\hat{\varPi }}}\varvec{H}), \varvec{H}=f_{\varvec{W}}(\varvec{X}), \end{aligned}$$

where \(\varvec{{\hat{\varPi }}}\) is the approximate PPR matrix, \(f_{\varvec{W}}(\cdot )\) denotes a fully connected neural network which is parameterized by the weight matrix \(\varvec{W}\), and \(\varvec{Z}\) is the final embedding matrix.

As a matter of fact, PPR can be used to estimate the importance of each node during the feature propagation process. In particular, if more non-zero PPR values exist in the PPR matrix from the perspective of node u, then node u will have impact on more nodes during the feature propagation process. In our framework, we focus on the PPR values that are sufficiently large, i.e., larger than some pre-defined threshold \(\theta \) (otherwise the values will be negligible), in which case the feature propagation on such node pairs is highly likely to affect the final embeddings. However, this node selection strategy only considers the structural information of the graph and overlooks the valuable feature information. This limitation might lead to sub-optimal performance when applied to downstream GNNs. To address this problem, we introduce a diversity function that aims to achieve the maximum coverage of the selected nodes in the embedding space. Similar to the PPR coverage, we focus on the embeddings that the Euclidean distances are smaller than some pre-defined threshold d, in which case the embeddings of such node pairs are sufficiently similar to each other.

By integrating these two components, in this section, we elaborate on the objective function of FICOM, which aims to solve a novel PPR-based diversified coverage maximization problem.

3.1 PPR coverage

Given an input graph \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})\) with \(|{\mathcal {V}}|=n\) nodes and \(|{\mathcal {E}}|=m\) edges, we first assume that the PPR values are given and defer computation details to Sect. 4. Given a threshold \(\theta \), if the PPR value of node v with respect to node u is above \(\theta \), i.e. \(\varvec{\pi }_u(v) > \theta \), then we say v is covered by node u. Denote the set of nodes covered by u as \(\phi (u)\). We further define the PPR coverage of node u as \(|\phi (u)|\), i.e., the number of nodes covered by u. Formally,

$$\begin{aligned} \phi (u)=\{v|\varvec{\pi }_u(v) > \theta , \forall v \in {\mathcal {V}}\}. \end{aligned}$$

Similarly, the nodes covered by set S, denoted as \(\phi (S)\), is the set of nodes that are covered by at least one node in set S. Then, the PPR coverage of set S is denoted as \(|\phi (S)|\). More formally,

$$\begin{aligned} \phi (S)=\bigcup _{u\in S} \{v|\varvec{\pi }_u(v) > \theta , \forall v \in {\mathcal {V}}\}. \end{aligned}$$

Definition 2

(PPR Coverage Maximization Problem) Given a graph \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})\), an integer B, the PPR coverage maximization problem asks for a size B set \(S^*\) with the largest PPR coverage, i.e.,

$$\begin{aligned} |\phi (S^*)|=\mathop {\arg \max }\limits _{S':|S'|=B}|\phi (S')|. \end{aligned}$$

Remark

Specifically, if a node t has PPR scores above the threshold with respect to more than one node in a set S, the number of nodes covered by set S will only count node t once. This definition is to find the set S that can propagate enough information (at least \(\theta \)) to the maximum number of nodes. However, if we take the sum of scores to define the coverage, it may sum the score multiple times and the selected nodes may only propagate a large amount of information to a small set of nodes, which may hamper the training effectiveness.

The following theorem shows the hardness of the above-mentioned problem.

Theorem 1

The PPR coverage maximization problem is NP-hard.

Proof

We prove that the PPR coverage maximization problem is NP-hard by the reduction from the maximum coverage problem. The definition of maximum coverage problem is as follows: given a collection of sets \(S=\{S_1, S_2,..., S_m\}\) of a universe \(U=\{u_1, u_2,..., u_n\}\), and a number \({\mathcal {B}}\). The sets may have some elements in common. Formally, the objective of the maximum coverage problem is to find a subset \(S' \subseteq S\) of sets, such that \(|S'|\le {\mathcal {B}}\) and the total number of the covered elements \(|\bigcup _{S_i \in S'} S_i|\) is maximized.

Given an arbitrary instance \({\mathcal {I}}\) of maximum coverage problem, we construct a bipartite graph \({\mathcal {G}}=({\mathcal {V}},{\mathcal {W}},{\mathcal {E}})\) with \((2m+n)\) nodes, where \(|{\mathcal {V}}|=m\) and \({\mathcal {W}}=n+m\). We divide the node set \({\mathcal {W}}\) into two parts: \(|{\mathcal {W}}_1|=n\) and \(|{\mathcal {W}}_2|=m\). Each node \(i \in {\mathcal {V}}\) corresponds to a set \(S_i\) and each node \(j \in {\mathcal {W}}_1\) corresponds to an element \(u_j\). Since \(|{\mathcal {V}}|=|{\mathcal {W}}_2|=m\), there always exists a bijection function \(f:{\mathcal {V}} \rightarrow {\mathcal {W}}_2\) such that for each \(i \in {\mathcal {V}}\), we can find a unique \(q \in {\mathcal {W}}_2\) corresponding to i. For each \(p \in {\mathcal {W}}_1\), there exists a directed edge \({\mathcal {E}}(i,p)\) if and only if \(u_p \in S_i\); for each \(q \in {\mathcal {W}}_2\), there exists a directed edge \({\mathcal {E}}(i,q)\) if and only if \(f(i)=q\). Therefore, the set maximum coverage problem is equivalent to choosing \({\mathcal {B}}\) nodes in \({\mathcal {V}}\) that maximizes the total number of neighbors in \({\mathcal {W}}\). For each \(i \in {\mathcal {V}}\) we have \(|{\mathcal {N}}(i)| = |S_i|+1\ge 1\), where \(|{\mathcal {N}}(i)|\) is the number of the neighbors of node i in \({\mathcal {W}}\). Thus, for the instance \({\mathcal {I}}\) we have defined, given a fixed number \(0< \alpha < 1\), we can always find a positive number \(\theta \) such that \(\theta < \frac{\alpha (1-\alpha )}{\mathop {\max }\limits _{i \in {\mathcal {V}}}|{\mathcal {N}}(i)|}\). Then, we can easily transform it into an instance \({\mathcal {J}}\) of the PPR coverage maximization problem on graph \({\mathcal {G}}\) with random walk stopping probability \(\alpha \) and PPR threshold \(\theta \). Thus, if we can get the optimal solution of a PPR coverage maximization problem instance \({\mathcal {J}}\), then we can solve the maximum coverage problem instance \({\mathcal {I}}\). Since the maximum coverage problem is NP-hard [9], we can infer that the PPR coverage maximization problem is NP-hard. \(\square \)

A key property of the PPR coverage maximization problem is submodularity, whose definition is as follows.

Definition 3

(Submodular Function [18]) Given an arbitrary function \(f(\cdot )\) that maps subsets of a finite ground set U to non-negative real numbers. We say f is submodular if the marginal gain from adding an element to a set S is at least as high as the marginal gain from adding the same element to a superset of S. Formally, a submodular function satisfies:

$$\begin{aligned} f(S \cup \{v\}) - f(S) \ge f(T \cup \{v\}) - f(T), \end{aligned}$$

for all elements v and all pairs of sets \(S \subseteq T\).

Theorem 2 shows the submodularity of the PPR coverage maximization problem.

Theorem 2

The PPR coverage maximization problem is non-negative, non-decreasing, and submodular.

3.2 Embedding diversity

Given the PPR coverage of each node, we can select B representative nodes by straightforwardly solving the above PPR coverage maximization problem. However, this solution only considers the structural information in the feature propagation process and does not incorporate feature information into the node selection strategy, which might lead to sub-optimal performance on downstream GNNs. To tackle this issue, we present a diversity function that aims to achieve the maximum coverage of the selected nodes in the embedding space.

Recent works [20, 44] have empirically revealed that GNN basically benefits from feature propagation rather than non-linear activation. Inspired by this observation, we decouple the feature propagation from GNN architectures. Therefore, we can efficiently generate embeddings for each node based on its own features with a parameter-free model, eliminating the time-consuming training process. Simultaneously, this decoupling approach enlarges the neighborhood of each node by taking more propagation steps. Formally, for each layer, the features are propagated between neighbors by the PPR diffusion matrix, which can also be regarded as the combination of the smoothed node representations and residual connections (Ref. to Sect. 2.3):

$$\begin{aligned} \varvec{H}^{(k+1)} = (1-\alpha )\varvec{{\hat{A}}}\varvec{H}^{(k)} + \alpha \varvec{H^{(0)}}, \end{aligned}$$
(3)

where \(\varvec{H^{(0)}}\) equals to the original feature matrix \(\varvec{X}\). Notice that in Eq. 3, we remove all the non-linear operations during the feature propagation, facilitating efficient embedding generation.

Apparently, if a pair of nodes in the embedding space are close, then they are likely to have the same label during the inference. Given the node embeddings computed by Eq. 3 after \({\mathcal {K}}\) iteration, we select Euclidean distance as the distance measure:

$$\begin{aligned} dist(u,v)=\Vert \varvec{H}_u^{({\mathcal {K}})} - \varvec{H}_v^{({\mathcal {K}})}\Vert _2, \end{aligned}$$

where \(\varvec{H}_u^{({\mathcal {K}})}\) and \(\varvec{H}_v^{({\mathcal {K}})}\) denote the embedding vectors of node u and v, respectively.

Given a pre-defined distance d, if \(dist(u,v) \le d\), we say node u and v are d-reachable to each other. If more nodes are d-reachable to node u in the embedding space compared with node v, then node u is more representative than node v. Motivated by this, let D(u) denote the set of nodes that are d-reachable to u, our embedding diversity of node u, denoted as |D(u)|, is defined as the number of nodes that are d-reachable to u. Similarly, D(S) denotes the set of nodes that are d-reachable to at least one node in S. Naturally, the diversity of set S, denoted as |D(S)|, is the total number of nodes that are d-reachable to at least one node in S. More formally,

$$\begin{aligned} D(S)= \bigcup _{u \in S} \{ v | dist(u,v) \le d, \forall v \in {\mathcal {V}}\}. \end{aligned}$$
(4)

Theorem 3 shows the submodularity of the embedding diversity function.

Theorem 3

The embedding diversity function is non-negative, non-decreasing, and submodular.

3.3 Objective function

In this paper, we propose an objective function that combines the PPR coverage and embedding diversity in our node selection framework. Formally, the objective function of PPR-based diversified coverage maximization is:

$$\begin{aligned} \mathop {\max }\limits _{|S|=B} f(S)=(1-\gamma )\frac{|\phi (S)|}{{\hat{\phi }}}+\gamma \frac{|D(S)|}{{\hat{D}}}, \end{aligned}$$
(5)

where \({\hat{\phi }}=\max _{u\in {\mathcal {V}}_{train}}|\phi (u)|\), \({\hat{D}}=\max _{u\in {\mathcal {V}}_{train}}|D(u)|\). We set \(\gamma \) to be a balancing hyper-parameter for two components. In our experiments, we directly set \(\gamma = 0.5\), where both components are of equal importance.

Remark

There exist many factors that may influence the accuracy of downstream GNNs. In this paper, we focus on a diversified subset selection problem, which is an important direction in the literature on active learning. For instance, Brinker et al. [3] incorporate a diversity measure to maximize the angles within a set of hyperplanes. Yang et al. [50] explicitly impose a diversity constraint on the objective. Tang et al. [33] first introduce the diversified social influence maximization problem, primarily focusing on the traditional influence maximization problem [18] with diversity based on the category distribution of each node. In GRAIN [54], the authors employ a complicated influence definition and do not consider the influence from the graph topology perspective, which is hard to derive an unbiased estimation. IN contrast, FICOM introduces a novel perspective on node importance criterion by considering its influence during the GNN feature propagation process and the diversity it contributes in the embedding space. Moreover, in the diversity definition [33], the diversity of set S considers all nodes in \(\phi (S)\), i.e., all nodes that are influenced by set S. This complicated definition again makes it difficult to estimate the coverage score in an unbiased manner. In our FICOM, to enable more efficient calculation, we utilize the relaxed diversity that only focuses on set S rather than \(\phi (S)\). The intuition behind this relaxation is that if set S is diverse, then set \(\phi (S)\) tends to be diverse as well.

Since in Theorems 2 and 3, \(|\phi (S)|\) and |D(S)| are proved to be non-negative, monotone (non-decreasing), and submodular functions, and the positive linear combination of submodular functions is also submodular [27], we can easily derive the following theorem.

Theorem 4

The objective function of FICOM is non-negative, monotone, and submodular.

Node selection: Submodularity is one of the important properties of our objective function, enabling a greedy algorithm with \((1-1/e)\)-approximation guarantee based on the following theorem.

Theorem 5

( [18, 27]) For a non-negative, monotone submodular function f, let S be a set of size B obtained by selecting elements one by one, each time choosing an element that provides the largest marginal increase in the function value. Let \(S^*\) be a set that maximizes the value of f over all sets of size B. Then,

$$\begin{aligned} f(S) \ge (1-1/e)\cdot f(S^*). \end{aligned}$$

However, it is rather expensive to do the greedy selection since it requires knowledge of the objective function score for each node. This needs the computation of all-pair distances and the computation of all PPR scores, which is not scalable to large networks. To tackle this issue, we present an approximate solution to scale our FICOM to large networks.

4 FICOM framework

Algorithm 1
figure a

Approx-PPR-Coverage

To solve the optimization problem in Eq. 5, we first need to calculate the maximum PPR coverage \({\hat{\phi }}\) and the maximum diversity \({\hat{D}}\). Then, a standard greedy algorithm [27] requires finding the node with the largest marginal gain among O(n) nodes. Specifically, exactly computing the PPR coverage and diversity of a node u takes \(O(m\log n)\) and O(nF) time, respectively. Therefore, selecting B nodes with maximum marginal gain already takes \(O(Bn(m\log n+nF))\), even if we omit the updating cost. This quadratic cost is prohibitive on large graphs. To tackle this issue, we present an efficient solution that takes linear time with respect to graph size while still providing a \((1-1/e)\)-approximation guarantee. The main idea of FICOM is as follows: we first adaptively maintain the lower- and upper-bound of the objective value for each node. Next, we prune the less important nodes based on the estimated values. Then, we only need to select the node among a small subset of candidate nodes with recomputed exact values. We provide details of \({\hat{\phi }}\) computation in Sect. 4.1, \({\hat{D}}\) computation in Sect. 4.2, and the node selection strategy in Sect. 4.3.

Algorithm 2
figure b

Candidate-Selection

4.1 Maximum PPR coverage computation

Recap from Sect. 1 that, the PPR value \(\varvec{\pi }_u(v)\) is the probability that an \(\alpha \)-discount random walk starting from node u stops at node v, where this random walk either stops at the current node with \(\alpha \) probability or randomly move to one of its out-neighbors with \((1-\alpha )\) probability. Recent studies [14, 15, 23, 25, 38,39,40, 42, 43, 45] present approximate algorithms via combining different techniques with theoretical guarantees for high-precision PPR computation. However, the objective of FICOM requires computing the PPR values for each node pair to find the node with the largest PPR coverage score to derive \({\hat{\phi }}\). To tackle this issue, we introduce a novel algorithm that first prunes the nodes with small PPR coverage values by estimating the upper- and lower-bound of \(\phi (u)\) for each u.

Approximate PPR coverage: Following the existing studies on approximate PPR estimation [23, 25, 39, 41, 42, 45], we adopt the classic forward push + random walk framework to estimate the PPR scores for each node u. However, all existing work initializes the forward push and then simulates the random walks step by step. In our solution, we mix these two parts and initialize \(\omega =\frac{3\log {(1/p_f)}}{\epsilon ^2\delta }\) random walks for each node s. Then, it either pushes or simulates random walks to derive the estimation with improved efficiency.

Algorithm 1 shows the pseudo-code of the approximate PPR coverage algorithm. Given a source node s, the number of \(\alpha \)-discount random walks starting from node s we sample is \(\omega \). Then, a stopped vector \(\varvec{s_t}\) is maintained to store the portion of random walks that have stopped at each node. Besides, an additional sample vector \(\varvec{s_a}\) is maintained, indicating the portion of \(\alpha \)-discount random walks that currently stay at each node but have not stopped yet. Thus, if the entries of \(\varvec{s_a}\) are all zero values, \(\varvec{s_t} / \omega \) returns the portion of random walks that finally stop at each node.

Initially, all entries in \(\varvec{s_a}\) are set to zero except for \(\varvec{s_a}(s)=\omega \) (Lines 1–2), indicating that the \(\alpha \)-discount random walks initially all stay at the source node s and have not stopped yet. If the source node s has no neighbor, it will immediately stop all random walks (Lines 3–4), otherwise, node s will be pushed into queue Q (Line 6). Then, while Q is not empty, the first node v in Q will be removed (Line 8). If the current number of unstopped random walks at node v is above its degree, a forward push operation is invoked on node v (Lines 9–15). In particular, it first converts \(\alpha \cdot \varvec{s_a}(v)\) to \(\varvec{s_t}(v)\) (Line 10). To explain, \(\alpha \) portion of the random walks \(\varvec{s_a}(v)\) stop at the current node v. Next, the remaining \((1-\alpha )\) portion of the random walks \(\varvec{s_a}(v)\) will move equally to the out-neighbors of the current node v (Lines 11–14). Thus, for each u that is an out-neighbor of v, the sample value \(\varvec{s_a}(u)\) is increased by \((1-\alpha )\cdot \varvec{s_a}(v)/|{\mathcal {N}}(v)|\). After the forward push operation, the number of unstopped samples at v, i.e., \(\varvec{s_a}(v)\), is set to zero (Line 15). This process will terminate when Q is empty. Then, for random walks that have not stopped yet at each node \(v \in {\mathcal {V}}\), it will restart all these \(\alpha \)-discount random walks from node v until each random walk stops (Lines 16–19). Specifically, for each destination node u, it will add \(\frac{\varvec{s_a}(v)}{\lceil \varvec{s_a}(v) \rceil }\) to \(\varvec{s_t}(u)\) (Line 19). Finally, if the number of random walks stopped at node v, i.e., \(\varvec{s_t}(v) > \omega \theta (1-\epsilon )\), it will be added into \(\phi ^+(s)\); if the number is above \(\omega \theta (1+\epsilon )\), it will be further added into \(\phi ^-(s)\) (Lines 20–24). The portion of \(\alpha \)-discount random walks that stop at node v is \(\frac{\varvec{s_t}(v)}{\omega }\). Thus, it can be regarded as an approximation of the true PPR value \(\varvec{\pi }_s(v)\). With the Approximate-PPR-Coverage algorithm, for \(\forall v \in {\mathcal {V}}\), if \(\varvec{\pi }_s(v)>\delta \), \(\frac{\varvec{s_t}(v)}{\omega }\) achieves \(\epsilon \) relative error with respect to \(\varvec{\pi }_s(v)\) with high probability as summarized in the following theorem rephrased from [23, 42].

Theorem 6

( [42]) By sampling \(\omega =\frac{3\log (1/p_f)}{\epsilon ^2\delta }\) \(\alpha \)-disc- ount random walks from a source node s, for any v with \(\varvec{\pi }_s(v)>\delta \), the approximate PPR coverage algorithm guarantees that \(|\varvec{{\hat{\pi }}}_s(v)-\varvec{\pi }_s(v)|\le \epsilon \varvec{\pi }_s(v)\) holds with \((1-p_f)\) probability in \(O\left( \frac{\log (1/p_f)}{\epsilon ^2\delta }\right) \) time.

By Theorem 6, we can derive the upper- and lower-bound of \(\phi (s)\) as follows. (i) For a node v with \(\varvec{{\hat{\pi }}}_s(v)>\theta (1+\epsilon )\), we have \(\varvec{\pi }_s(v)>\theta \) with \((1-p_f)\) probability, i.e., \(v \in \phi (s)\) with \((1-p_f)\) probability. Then, node v is added into \(\phi ^-(s)\), which is a subset of nodes whose lower bound of their PPR scores are no smaller than \(\theta \). Thus, \(|\phi ^-(s)|\) is a lower bound of the PPR coverage of node s. (ii) For node u that satisfies \(\varvec{{\hat{\pi }}}_s(v)<\theta (1-\epsilon )\), we have \(\varvec{\pi }_s(u)<\theta \) with \((1-p_f)\) probability, i.e., \(\phi (s)\) does not contain u with \((1-p_f)\) probability. For the nodes with \(\omega \theta (1+\epsilon ) \ge \varvec{s_t}(u)\ge \omega \theta (1-\epsilon )\), we can derive that \(\varvec{\pi }_s(u)\) might or might not be in \(\phi (s)\). In this case, we add all such nodes in \(\phi ^+(s)\). Furthermore, we add all nodes with \(\varvec{s_t}(u) \ge \omega \theta (1+\epsilon )\) to \(\phi ^+(S)\). Then, \(|\phi ^+(s)|\) provides an upper bound on the PPR coverage of s.

Algorithm 3
figure c

Exact-PPR-Coverage(\(US, \theta , s\))

Candidate selection: The approximation error \(\epsilon \) in Algorithm 1 can be set to control the estimation of the PPR coverage for each node u in a given searching pool S. The goal of the candidate selection is to eliminate the less important nodes with small objective values, which can be either PPR coverage or diversity.

Algorithm 2 shows the process of candidate selection. Given vector \(\varvec{g}^-\) that stores the lower-bound estimation of the set size (e.g., PPR coverage or diversity) for each node \(u \in S\), and vector \(\varvec{g}^+\) that stores the upper-bound estimation of the set size for each node \(u \in S\), it first finds the maximum lower bound \(g_{\max }\) of \(\varvec{g}^-(u)\) (Lines 2–4). Then, it adds node u into the candidate set C if \(\varvec{g}^+(u)\) is larger than \(g_{\max }\) (Lines 5–7). For example, to compute the maximum PPR coverage score \({\hat{\phi }}\), we feed \(\varvec{g}^-\) (resp. \(\varvec{g}^+\)) as the vector of the lower-bound (resp. upper-bound) of the PPR coverage for each node. Then, we can use Algorithm 2 to find a candidate set C of nodes whose PPR coverage scores could potentially be the maximum score.

Exact PPR coverage: The candidate selection algorithm significantly reduces the number of candidate nodes. After the selection, we only calculate the exact PPR coverage for each node in the candidate set C.

The pseudo-code of the exact PPR coverage algorithm can be found in Algorithm 3. It first adopts an exact PPR computation algorithm, e.g., the classic Power-Iteration [28] or recently proposed PowerPush algori- thm [45], to calculate single source PPR values for a given source node s (Line 2). Then, it counts the total number of nodes in a given subset US (US is equal to \({\mathcal {V}}\) in Algorithm 4) whose \(\hat{\varvec{\pi }}_s(v) > \theta \) for the given source node s. Finally, it returns \(\phi (s)\) as the exact PPR coverage set of node s.

Algorithm 4
figure d

Maximum-PPR-Coverage

Maximum PPR coverage computation: After calculating the exact PPR coverage of the nodes in the candidate set C, we directly derive the largest PPR coverage value of nodes in C as the maximum PPR coverage value \({\hat{\phi }}\) (recap. the definition in Eq. 5).

Algorithm 4 shows the pseudo-code of the PPR ma- ximum coverage \({\hat{\phi }}\) computation. In the beginning, we set candidate set C to be the whole training set \({\mathcal {V}}_{train}\) (Line 1). Next, when the size of the current candidate set C is above a given threshold \(\theta _c\), the candidate selection process begins (Lines 2–7). In particular, the error parameter \(\epsilon \) of Algorithm 1 is firstly divided by 2. Then, for each node u, we get the upper-bound \(\phi ^+(u)\) and the lower-bound \(\phi ^-(u)\) of its coverage set (Lines 4–6). Two size vectors \(\varvec{\varphi }^+\) and \(\varvec{\varphi }^-\) are maintained to store the upper- and lower-bound size estimation, respectively. After that, a candidate selection algorithm (Algorithm 2) is invoked to prune the nodes with small PPR coverage values. This process terminates when the size of C is no larger than \(\theta _c\). Finally, Algorithm 3 is invoked for each node v in the candidate set C to get their exact PPR coverage scores (Lines 8–11). The largest PPR coverage score in the candidate set C is returned as \({\hat{\phi }}\).

Theorem 7

It takes \(O\left( \frac{n\log {(1/p_f)}}{\epsilon ^2 \delta }+ \theta _c \cdot c_{exact-ppr}\right) \)

cost to return \({\hat{\phi }}\) in Algorithm 4, where \(c_{exact-ppr}\) is the running cost of the exact PPR algorithm, which is generally bounded by \(O\left( m\cdot \log {n}\right) \).

It is easy to derive the above result since it takes \(O\left( \frac{\log {(1/p_f)}}{\epsilon ^2 \delta }\right) \) cost for each node to compute the approximate PPR coverage. Then, it computes the exact PPR score for each node \(v\in C\), where the size of C is bounded by \(\theta _c\) according to Algorithm 4.

Algorithm 5
figure e

BKS-Insertion-Deletion

4.2 Maximum diversity computation

A straightforward solution for the computation of the second component in our objective function, i.e., diversity, requires calculating all-pair distances of the embeddings. However, the time complexity is \(O(n^2F)\), which is too time-consuming in practice. To reduce the time cost, we utilize the Bottom-k sketch (BKS) for diversity estimation. With careful initialization of the bottom-k sketch, a vast number of nodes with small diversity values can be pruned in advance, which significantly reduces the running cost and scales our FICOM to large graphs. We first revisit the definition of the bottom-k sketch. Given a finite universe U, let h be a hash function that maps each element \(u \in U\) to a random value in [0, 1]. For any subset \(S \subseteq U\), we define \(b_i(S)\) to be the i-th smallest hash value among the hash values of all elements in S. Formally, the BKS [7] of a subset \(S \subseteq U\) is defined as follows:

Definition 4

(Bottom-k sketch [7]) Let n be the size of U. Given a single hash function h that maps each element to a uniform random value in [0, 1], the bottom-k sketch of a subset \(S \subseteq U\), i.e., \(BKS_S\), is a list of entries \(b_i(S)\), for \(i\le k\).

Notice that according to the above definition, the size of the bottom-k sketch might be smaller than k if |S| is smaller than k. In our setting, the universe is \({\mathcal {V}}\). Given an arbitrary node v, we consider the d-reachable set D(v) of node v and aim to use the bottom-k sketch of the d-reachable set D(v) to estimate the size |D(v)|. The BKS is one of the state-of-the-art schemes for cardinality estimation. We have the following theorem for the approximation guarantee [7].

Theorem 8

For a given set S, an estimate \({\hat{s}}\) of the size |S| with \(\epsilon \) relative error guarantee, i.e., \(|{\hat{s}}-|S||\le \epsilon \cdot |S|\), can be derived by \(BKS_S\) as follows:

$$\begin{aligned} |S| = \left\{ \begin{aligned}&|BKS_S|,{} & {} \text {if} \ |BKS_S| < k \\&\frac{k-1}{b_k(S)},{} & {} \text {otherwise.} \end{aligned} \right. \end{aligned}$$
(6)

with \((1-p_f)\) probability if we set \(k=\lceil \frac{1+\epsilon }{\epsilon ^2p_f} \rceil \).

Algorithm 6
figure f

BKS-Estimation

Efficient adaptive BKS computation: As previously mentioned, we adopt BKS for the estimation of |D(v)| in our FICOM. There are still obstacles if we want to apply this technique in the d-reachable set size estimation for each node v. To explain, to compute the BKS for D(v), we need to get D(v), which takes O(n) time already. If we compute the BKS in a brute-force manner, it is no better than the naive method that computes the distance for all node pairs.

To tackle this issue, our key observation is that only k nodes with the smallest hash value affect the estimation of |D(v)|. Motivated by this observation, we first sort the nodes in \({\mathcal {V}}\) into an ascending order array L according to their hash values. Then, we compute the bottom-k sketch along the array L, which will help efficiently prune nodes with small diversity values as to be shown shortly. The construction is as follows. Given that we are at position idx of the array L, we compute the d-reachable set D(L[idx]) of L[idx] and insert L[idx] into BKS(u)Footnote 3 for each \(u\in D(L[idx])\). Next, we move from idx to \(idx+1\) and repeat the same steps. Notice that even though we only need the top k smallest hash values to derive the estimation of D(v), we maintain all the hash values in increasing order so as to support efficient deletion.

In our design, we take advantage that the nodes are inserted in increasing hash value order. We call this the monotone-increasing property. This supports us to insert with O(1) cost and delete in amortized O(1) cost. The deletion operation is to support later updates when a node v is already d-reachable to a selected node w. Then, we need to remove v from BKS(u) for all node u in the d-reachable set of v to re-estimate the d-reachable set size of u. Another advantage of the monotone-increasing property in the bottom-k sketch construction is that we can always underestimate the value \(b_k(D(u))\) even if \(|BKS(u)|< k\), and thus we can derive an upper bound of the size |D(u)| for any \(u\in {\mathcal {V}}\). In particular, if D(u) already contains k entries, we can already derive its size estimation. If \(|BKS(u)|<k\), then \(b_k(D(u))\) will be no smaller than \(h(L[idx+k-|BKS(u)|-1])\). To explain, even if the next \((k-|BKS(u)|)\) nodes in the array are d-reachable to u, the size of BKS(u) just reaches k. Thus, we can derive an upper bound of the estimated size by assuming that \(b_k(D(u))\) is \(h(L[idx+k-|BKS(u)|-1])\). Besides, we can derive a trivial lower bound of |D(u)| for each node u by taking the current size of BKS(u). We will use these upper- and lower-bounds to effectively prune nodes. Below, we summarize the operations supported in our BKS for an arbitrary node u.

  • insert(v): insert node v into BKS(u);

  • delete(v): delete node v from BKS(u);

  • upper(idx): estimate the upper-bound of |D(u)| according to node \(L[idx+k-|BKS(u)|-1]\), whose hash value is an underestimate of \(b_k(D(u))\);

  • lower(): estimate the lower-bound of |D(u)| according to Eq. 6;

Algorithm 7
figure g

Exact-Diversity(USsd)

To fully exploit the monotone-increasing property to support the O(1) insertion and amortized O(1) deletion, we need the following variables listed in \(init(\epsilon , p_f)\) (Algorithm 5 Lines 2–3) to store the BKS information. The size variable stores |BKS(u)|. A variable \({\underline{max}}\) is maintained to store the node with the min(ksize)-th smallest hash value in BKS(u) since it is the only useful hash value for estimation. The linked list pool is utilized for temporarily storing the excess nodes, i.e, the nodes with hash values larger than \(b_k(D(u))\). The set invalid is used for lazy deletion. To explain, when we do not have to delete a hash value immediately, we just put it in invalid, when we have to delete a hash value explicitly, we delete all the hash values that are in invalid as a side-product. We will explain when to do the deletion explicitly shortly.

Algorithm 5 shows the pseudo-code of the basic operations of the BKS. For insertion operation, if the current size of BKS(u) is less than k, it simply updates the max node from the previous node to the newly inserted node v (Lines 5–6); otherwise, it adds v into the linked list pool (Lines 7–9). Finally, we increase the size by 1 (Line 10). For deletion operation, if the current size is no larger than k, we do not have to update max, since max is used for estimation only when \(|BKS(u)|\ge k\). If the current size is larger than k, i.e., pool is not empty, it needs to compare h(v) with h(max) (Lines 12–19). In particular, if \(h(v) > h(max)\), it only adds v into the invalid set (Lines 13–14); otherwise, the old max node has to be updated, which means we have to do deletion explicitly (Lines 15–19). To explain, we scan across the linked list pool, delete the nodes that are in invalid and stop when we first meet a node that is not in invalid. Then, we set the node pool.head we stopped at to be the new max node, and move pool.head to its next position. Finally, we decrease the size by 1 (Line 20). The following theorem shows that with monotone-increasing property, BKS achieves O(1) insertion cost and amortized O(1) deletion cost.

Algorithm 8
figure h

Maximum-Diversity

Theorem 9

The BKS achieves O(1) insertion time

and amortized O(1) deletion time complexity.

The above theorem is easy to verify since it clearly takes O(1) time for each insertion of BKS(u). For deletion, if a node to be deleted is from the pool list, the cost is simply O(1) to mark it to be invalid. When a node is deleted from BKS(u), then it will explicitly remove those nodes marked as invalid. This charges only twice for each deletion operation. Therefore, the amortized deletion cost is O(1).

The main purpose of maintaining BKS(u) is to estimate the size of D(u). Algorithm 6 shows the pseudo-code of the two estimation functions that are used for the BKS. BKS(u).lower() returns the lower-bound estimation of |D(u)| and BKS(u).upper(idx) returns the upper-bound estimation of |D(u)|. For lower-bound estimation, if the current size of BKS(u) is smaller than k, BKS(u).size will be directly returned as the estimation (Lines 2–3); otherwise, it will return a lower-bound of the size (Lines 4–5), which is the estimation using Eq. 6 divided by \((1+\epsilon )\). For upper-bound estimation, if the current size of BKS(u) is no less than k, which means h(max) is exactly \(b_k(D(u))\), it will return an upper-bound of the size (Lines 9–10), i.e., the estimation using Eq. 6 divided by \((1-\epsilon )\); otherwise, we underestimate the still unknown \(b_k\) by \(h(L[idx+k-size-1])\) and hence overestimate the size of D(u) (Lines 7–8). The following theorem shows that when the size of BKS(u) is no less than k, we provide a lower- and upper-bound for the size of D(u) with high probability.

Theorem 10

Let div(u) be the estimated diversity of node u derived by BKS(u) according to Eq. 6. With \(k=\lceil \frac{1+\epsilon }{\epsilon ^2p_f} \rceil \), when the size of the BKS(u) is no less than k, it returns an upper-bound estimation \(div(u)/(1-\epsilon )\) and a lower-bound estimation \(div(u)/(1+\epsilon )\) of |D(u)| with \((1-p_f)\) probability.

Algorithm 9
figure i

Greedy-Selection

Maximum diversity computation: Recap that the objective function in Eq. 5 needs the maximum diversity score \({\hat{D}}\). Algorithm 8 shows the pseudo-code of how to derive \({\hat{D}}\). It first initializes the candidate set C to be the whole training set \({\mathcal {V}}_{train}\) (Line 1). While the size of the current candidate set C is above a given threshold \(\theta _c\), it updates the BKS of each node and thus updates C. (Lines 2–12). In specific, it computes d-reachable sets of B nodes, and for each computed node u, it inserts u into BKS(v) for each node \(v \in D(u)\). Then, for each node \(u \in C\), we get the upper-bound (resp. lower-bound) of its diversity \(\varvec{div}^+(u)\) (resp. \(\varvec{div}^-(u)\)) by using upper(idx) (resp. lower()) estimation operation in BKS (Lines 10–11). After that, a Candidate Selection algorithm (Algorithm 2) is invoked to prune the nodes with small diversity values (Line 12). When there are still more than \(\theta _c\) nodes in the candidate set C, we sample B more nodes from L that have the next B smallest hash values in \({\mathcal {V}}\) and repeat the above process. This process terminates when the size of C is no larger than \(\theta _c\). Finally, it calculates the exact diversity for each node \(u \in C\) to get the exact size of D(u) (Line 13). The largest |D(u)| will be returned as \({\hat{D}}\) (Line 14). Let the total number of sampled nodes be \(B_{f}\) and F be the dimension of the feature vector. The following theorem shows the cost of Algorithm 8.

Theorem 11

It takes \(O\left( n\cdot F\cdot (B_f+\theta _c)\right) \) time to return \({\hat{D}}\) in Algorithm 8.

4.3 Node selection

Finally, we turn to the node selection. The main idea is still to derive a tighter and tighter upper- and lower-bound for each node adaptively so that we can find the node with the largest marginal gain. Notice that we do not need to derive the upper- and lower-bounds of the PPR coverage and embedding diversity values from scratch. Instead, we will reuse the upper- and lower-bound of the PPR coverage (resp. diversity score) we derived for each node in Sect. 4.1 (resp. Section 4.2). Next, we explain the details of our greedy algorithm.

Algorithm 10
figure j

Update

Greedy algorithm: The greedy selection process consists of two components: selection and update. The idea of greedy selection is the same as maximum PPR coverage and maximum diversity computation. We first use estimated objective values to select a small subset of nodes and add them to the candidate set. Then, we recursively select one node from C and update the affected sets/values for the next round selection.

Algorithm 9 shows the pseudo-code of the greedy selection algorithm. It reuses variables and results from Algorithms 4 and 8 (Line 1). Then, a 3-Phase selection process begins to select B nodes. In Phase 1, it utilizes estimated \(\phi \) and D for each node \(u \in {\mathcal {V}}_{train} \) to get the upper- and lower-bound of f(u) (Lines 4–5), where f(u) is the marginal gain of u given the selected set S. Next, a candidate selection process is invoked on \(\varvec{f}^-\) and \(\varvec{f}^+\) for prescreening (Line 6). After that, while the size of the candidate set is larger than \(\theta _c\), it repeatedly estimates the upper- and lower-bound of \(\phi \) (as Algorithm 4 Lines 3–6), D (as Algorithm 8 Lines 3–11), and objective value f, to update the candidate set C (Lines 7–11). In Phase 2, it calculates the exact d-reachable set D(u) for each node \(u \in C\). (Lines 12–13). Then, it utilizes the estimated PPR coverage and the exact diversity of each node \(u \in C\) to get a tighter upper- and lower-bound of f(u) (Line 14). After that, the candidate selection algorithm is invoked again to get the new candidate set C (Line 15). In Phase 3, it calculates the exact PPR coverage for the remaining nodes in C, and then gets the final objective values of these nodes (Lines 16–18). The node \(u^*\) with the maximum \(\varvec{f}(u)\) is added into the node set S (Lines 19–20).

The update function in Algorithm 9 Line 21 is designed to ensure that in the next round, the algorithm will select the node with the correct maximum marginal gain in f. Thus, we should update the PPR coverage set (resp. d-reachable set) of each node to eliminate those nodes that have been covered by (resp. d-reachable to) S already. Algorithm 10 shows the update process after each round of greedy selection. For each node v that is covered by the selected node \(u^*\), it removes v from uncovered set \(US_\phi \) and updates the lower-bound estimations \(\phi ^-\), the upper-bound estimations \(\phi ^+\), and the exact \(\phi \) of other nodes in \({\mathcal {V}}_{train}\) (Lines 1–5). Then, for each node v that is d-reachable to the selected node \(u^*\), it removes v from unreachable set \(US_D\) and updates the lower-bound estimations \(\varvec{div}^-\), the upper-bound estimations \(\varvec{div}^+\), and the exact D of other nodes in \({\mathcal {V}}_{train}\) (Lines 6–13). In particular, \(u^*\) is removed from each BKS that contains it. Thus, we get the updated value of \(\varvec{div}^-(q)\) and \(\varvec{div}^+(q)\) for each q.

Node selection: The complete node selection process of FICOM is shown in Algorithm 11. Firstly, Algorithms 4 and 8 are invoked to get the maximum value of PPR coverage \({\hat{\phi }}\) and diversity \({\hat{D}}\). Then, the greedy selection algorithm selects B nodes to get the output node-set of FICOM. Note that the BKS(v) of each node v, the estimation sets \(\phi ^+(v),\phi ^-(v)\), the estimation values \(\varvec{div}^+(v), \varvec{div}^-(v)\) of each node v are shared across different algorithms.

Algorithm 11
figure k

FICOM\((\epsilon ,\theta ,p_f,{\mathcal {V}}_{train},\theta _c,d,B,h)\)

Complexity analysis: Assume that when the algorithm terminates, the total number of sampled nodes for computing the diversity is \(B_f\). Then, for the greedy selection algorithm, it takes \(O(n F (B_f+B \theta _c) + \theta _c B\cdot \) \(c_{exact-ppr} )\) running cost. To explain, since it computes the bottom-k sketch for \(B_f\) nodes, it takes \(O(n F B_f)\) cost. Besides, in each iteration, we keep \(\theta _c\) nodes in the candidate set and we have B iterations. In each iteration, we compute the exact diversity set and exact PPR coverage for each node in the candidate set. This takes \(O\left( \theta _c \cdot B(nF+c_{exact-ppr})\right) \) running cost. Adding the cost altogether, we have the following theorem.

Theorem 12

FICOM has a running cost of:

$$\begin{aligned} O\left( \frac{n\log (1/p_f)}{\epsilon ^2\delta } + n F (B_f+B \theta _c) + \theta _c B\cdot c_{exact-ppr}\right) . \end{aligned}$$

5 Experiments

Table 2 Dataset statistics

We compare our FICOM against alternative solutions on the node classification task. All experiments are conducted on a Linux machine with 384GB memory and an NVIDIA Quadro RTX A6000 48GB GPU.

5.1 Experimental settings

Datasets: We test FICOM and competitors on six public datasets: four citation networks, Cora, Citeseer [30], Pubmed [26], and Paper100M [17, 37]; one social network, Reddit, and one product co-purchasing network, Products [1], that are used for inductive node classification tasks [17, 52]. The statistics of datasets are listed in Table 2.

Specifically, in four citation datasets Cora, Citeseer, Pubmed, and Paper100M, each node represents a document and each edge denotes a citation link. For Cora and Citeseer, the node feature is the bag-of-words representation vector of the abstract of each document and the label indicates the topic. For Pubmed, node features are TF/IDF-weighted word frequencies and labels correspond to types of diabetes studied in the publications. In the Paper100M dataset, each paper contains a 128-dimensional feature vector obtained by averaging the word embeddings of its title and abstract. In the Reddit dataset, nodes represent Reddit posts, nodes are linked if the same user comments on both posts, and node features include word embeddings of post titles, comments, and scores. In the Products dataset, each node is a product sold on Amazon, and each edge denotes that two products are purchased together. The node feature is generated by using PCA on its bag-of-words vector. The links to download the datasets are as follows:

Competitors: We evaluate FICOM against 6 node selection strategies, including some heuristics and several GNN active learning methods. We first include two heuristics Random and Degree, which select B nodes randomly and with top-B maximum degree, respectively. We further include four competitors: AGE [4], FeatProp [46], ALG [53], and GRAIN [54], which are discussed in Sect. 2.4. The links to the code of the four competitors are as follows:

Table 3 Hyper-parameters of GNNs on citation networks

Parameter settings: We set the stopping probability \(\alpha = 0.1\), threshold \(\theta = 0.001\), and the balancing hyper-parameter \(\gamma =0.5\). For diversity computation, we set reachable parameter \(d=0.5\), and \(\epsilon =0.05, p_f=0.125\) for BKS. We use grid search to set the propagation step \({\mathcal {K}}\) from \(\{0, 4, 8\}\). We conduct experiments using four well-known GNNs, including GCN, SGC, APPNP, and GCNII, to evaluate the effectiveness and generalization ability of different GNN active learning approaches. We obtain the source code of all competitors and GNN models from their inventors on GitHub and use their default settings in the experiment unless otherwise specified.

Table 3 shows the hyper-parameters of each GNN on citation networks. For the hyper-parameters of GCN on the Reddit and Product datasets, we use the same hyper-parameters as [54]. The links to the code of each GNN model are as follows:

5.2 Semi-supervised node classification

In this section, we compare our FICOM against other competitors on the semi-supervised node classification tasks. For citation networks, we apply the standard fixed validation/testing data split [19], i.e., 500 nodes for validation and 1000 nodes for testing; for Reddit and Products datasets, we apply the standard fixed validation/testing data split used in [13, 17, 52]. Let c denote the number of classes, we follow the label rate of [5, 13, 19, 20, 44] and set the training budget \(B=20c\) [4, 53]. For each method, only nodes in the training set are visible for labeling. The implementation of all these GNN models is based on PyTorch [29] and all the results are obtained after 10 trials.

Results on GCN: Table 4 reports the mean classification accuracy with the standard deviation of different GNN active learning approaches on GCN. We make the following observations. Firstly, all the GNN active learning approaches outperform the straightforward heuristics, such as randomly selecting B nodes for training and selecting nodes with top-B largest degree, demonstrating the necessity of designing the GNN active learning approach. Secondly, our FICOM achieves the best classification accuracy scores on all three benchmark datasets. In particular, compared with the second-best GNN active learning approaches, our FICOM achieves \(1.91\%\) lead on Cora, \(2.25\%\) lead on Citeseer, and \(0.91\%\) lead on PubMed. Actually, our FICOM takes the lead by more than \(1.70\%\) on Pubmed except for AGE. Compared to AGE, the second-best method on Pubmed, our FICOM takes the lead by more than \(3.5\%\) on Cora and Citeseer. These observations reveal the effectiveness of our proposed FICOM.

Results on different GNNs: Since GCN is regarded as a shallow GNN model, to evaluate the generalization ability of different GNN active learning methods, we further conduct experiments on three GNN models, including two GNN models that decouple the feature propagation from non-linear transformation, i.e., SGC and APPNP, and one deep GNN model, GCNII.

Table 4 Node classification accuracy (%) on GCN
Table 5 Node classification accuracy (%) on SGC

Tables 5, 6 and 7 show the mean classification accuracy with the standard deviation of different GNN active learning methods on three GNN models. As we can observe, our FICOM consistently achieves the best performance on all datasets. Specifically, our FICOM takes the lead by \(1.16--1.86\%\) on SGC, \(1.47--1.93\%\) on APPNP, and \(0.19--3.52\%\) on GCNII. When APPNP is used as the backbone GNN model, our FICOM achieves \(85.90\%\) classification accuracy on Cora with standard validation/testing split, achieving even better than the current state-of-the-art result. Besides, the competitors become unstable when the deeper GNN model, GCNII, is used for the downstream classification tasks. These experimental results empirically demonstrate the superior generalization ability of our FICOM.

Inductive node classification on GCN: To evaluate the efficiency of our FICOM, we conduct experiments on two larger graphs, Reddit with 11 million edges, and Products with 61 million edges. Table 8 reports the average classification accuracy and the average running cost of selecting one node of each method. We omit a competitor if it cannot finish the selection in 24 h or run out of memory. Notice that we have to remove a vast number of nodes with smaller degrees as suggested by [54], such that FeatProp and GRAIN can finish the node selection process within 24 h or get rid of running out of memory, while our FICOM can still select nodes from the whole training set \({\mathcal {V}}_{train}\).

As we can observe, our FICOM consistently outperforms other competitors on these two large graphs in terms of accuracy. The aggressive removal of nodes significantly affects the accuracy of FeatProp and GRAIN and is no better than a random selection strategy. Besides, FICOM is 5.34x faster than GRAIN and 10.21x faster than FeatProp on the Reddit dataset, 5.58x faster than GRAIN and 3.39x faster than FeatProp on the Products dataset, respectively, demonstrating the effectiveness and efficiency of our FICOM.

Table 6 Node classification accuracy (%) on APPNP
Table 7 Node classification accuracy (%) on GCNII
Table 8 Node classification accuracy (%) and the average selection time per node (in seconds)

One may consider that random selection is already quite effective on large-scale networks. We show the effectiveness of our solution by further varying the labeling budget from 5c to 20c. Figure 1 shows the classification results of random selection and FICOM with varying labeling budget B. We omit the other baselines as they already show inferior performance and are extremely slow. Compared with the random selection strategy, our FICOM boosts the accuracy significantly from around 85% (resp. 50%) to more than 88% (resp. 57%) on the Reddit (resp. Products) dataset when we have a budget of 5c. In particular, our FICOM can achieve comparable results to random selection while requiring only half of the labeling budget. Compared with the results reported in [53, 54], our FICOM still achieves comparable results with the state-of-the-art GNN active learning methods.

Fig. 1
figure 1

Node classification accuracy: varying budget B

Table 9 Node classification accuracy (%) and the average selection time per node (in seconds)
Fig. 2
figure 2

Node classification accuracy: varying budget B

Evaluation on large dataset: To further validate the scalability of our FICOM, we conduct experiments on the Paper100M dataset, which consists of over 111 million nodes. Since GCN cannot scale to large graphs, in this set of experiments, we use SGC as the backbone GNN model. The experimental results are presented in Table 9 and Fig. 2.

Table 9 reports the average classification accuracy and the average running cost of selecting one node of each method. As we can observe, our FICOM consistently outperforms other competitors on the large Paper100M graph in terms of node classification accuracy. Note that the publicly available implementation of GRAIN requires pruning a substantial number of nodes during the pre-processing step for this dataset. In contrast, our FICOM can select nodes from the entire training set \({\mathcal {V}}_{train}\), which is two orders of magnitude larger than that of GRAIN. The aggressive removal of nodes significantly affects the accuracy of GRAIN, rendering it no better than a random selection strategy. Furthermore, FICOM is 8.17x faster than GRAIN on the Paper100M dataset.

Figure 2 illustates the classification results of random selection and FICOM with varying labeling budget B on the Paper100M dataset. We omit the other baselines due to their inferior performance. The results clearly demonstrate that compared with the random selection strategy, our FICOM boosts the accuracy significantly from around 40% to over 43% on the Paper100M dataset when given a budget of 5c.

These empirical evaluations again validate the effectiveness and efficiency of our FICOM.

5.3 Analysis

Balanced labeling set: Our FICOM consistently achieves the best classification results compared with other GNN active learning methods and achieves state-of-the-art performances on almost all datasets. Interestingly, we find slight performance degradation on the Cora dataset with GCNII compared to the standard split and label selection method, which selects a balanced number of labels for each class. To verify if our FICOM can achieve state-of-the-art performance when the labels are selected in a balanced manner for each class, we conduct another set of experiments as follows. We assume that we can select an equal number of labels for each class, i.e., 20 nodes for each class, as the standard method does. Let FICOM-B denote the node selection method FICOM with balanced label constraint. In particular, we greedily select the nodes with our FICOM algorithm and discard a node v if the number of labeled nodes with the same class as v is 20.

Tables 10 and 11 show the node classification results on Cora. As we can observe, FICOM-B achieves \(86.27\%\) accuracy on APPNP and \(85.56\%\) accuracy on GCNII, boosting the model performance and achieving the best accuracy. Notice that the balanced information is shown to be only beneficial for deep models while shallow models like GCN and SGC achieve the best performance without considering the balanced constraint, which demonstrates the effectiveness of our FICOM. In addition, the assumption that getting a balanced number of labeled nodes for each class is impractical since we cannot get the knowledge of the label information in advance. We may discard the labeled nodes that exceed the number for each class. However, such a strategy is wasteful as we can use such labeled data to train the model and gain better performance.

Besides, Fig. 3 illustrates the number of nodes for each class selected by both FICOM and FICOM-B on the Cora dataset. We can observe that the labels selected by FICOM are still diversified and relatively balanced among different classes. This demonstrates the effectiveness of our node selection strategy.

Fig. 3
figure 3

Number of selected nodes per class on Cora

Table 10 Balanced labeling: Node classification accuracy (%) on Cora with different GNNs
Table 11 Balabced labelling: Node classification accuracy (%) on Cora with varying depth of GCNII
Table 12 Ablation study on GCN
Fig. 4
figure 4

Node classification accuracy: varying d

Parameter sensitivity: We conduct experiments to analyze the effect of parameter d and stopping probability \(\alpha \). Figures 4-5 show the node classification accuracy on Citeseer and Reddit as we vary d from 0.25 to 1.0 and \(\alpha \) from 0.1 to 0.5, respectively. The reachable parameter d controls the embedding diversity, as d approaches 0.25, each node will have fewer d-reachable neighbors; while as d approaches 1.0, almost all node pairs are d-reachable. Both will result in undistinguished diversity scores and thus performance degradation. Meanwhile, as we can observe in Fig. 5, FICOM is not sensitive to the stopping probability \(\alpha \).

Fig. 5
figure 5

Node classification accuracy: varying \(\alpha \)

Ablation study: Next, we show the importance of considering both PPR coverage and embedding diversity. We use FICOM-C (resp. FICOM-D) to indicate the algorithm that only considers PPR coverage (resp. diversity). Table 12 lists the results of these three approaches on three citation datasets using GCN. As we can observe, FICOM-C and FICOM-D shows sub-optimal results on GCN in most scenarios, while our FICOM, combining both components, achieves the best performance. This shows the effectiveness of our framework.

6 Conclusion

In this paper, we present FICOM, an effective active learning framework for GNNs. The objective of FICOM formulates a novel diversified PPR coverage maximization problem. To solve this problem, we present an efficient approximation algorithm that scales our solution to large networks. Extensive experiments on six benchmark datasets using different GNNs demonstrate the efficiency and effectiveness of our FICOM.