1 Introduction

Graph Neural Networks (GNNs) have emerged as powerful tools across a wide range of domains, such as social network analysis (Li et al. 2017; Wu et al. 2018; Yu et al. 2020), recommendation systems (Ying et al. 2018; Wu et al. 2022; Fan et al. 2019), and drug discovery (Shi et al. 2020; Bongini et al. 2021). Their ability to capture intricate relationships within graph-structured data has driven significant advancements in the fields of machine learning and artificial intelligence (Li et al. 2022). As the application of GNNs continues to expand, the need for effective testing and evaluation methods becomes increasingly critical.

Similar to traditional deep neural networks (DNNs), testing GNNs faces challenges due to the lack of automated testing oracles (Dang et al. 2023). As a result, labeling GNN test inputs heavily relies on manual annotation, which can be expensive and time-consuming, especially when dealing with large and intricate graphs. Furthermore, in specific specialized domains such as molecular property prediction (Duvenaud et al. 2015), where nodes represent atoms and edges represent covalent bonds, the labeling process can heavily rely on domain-specific knowledge, further increasing the expenses.

In the literature (Ma et al. 2021; Chen et al. 2020; Wang et al. 2021), a promising approach for mitigating labeling costs is test selection. It focuses on the selection and labeling of a subset of data from the entire test set. Within the field of DNN testing, various test selection techniques have emerged. These techniques can be broadly classified into two categories: 1) test selection for rapid detection of potentially misclassified tests (Ma et al. 2021; Feng et al. 2020) and 2) test selection for precise accuracy estimation (Chen et al. 2020; Li et al. 2019). For simplicity, we refer to these approaches as misclassification detection approaches and accuracy estimation approaches, respectively.

Misclassification detection approaches are designed to identify test inputs that are most likely to be misclassified by the DNN model. These selected inputs serve two primary purposes: facilitating the debugging of DNN-based software and retraining the original DNN model to enhance its accuracy (Feng et al. 2020). In the literature, there are three main methods for misclassification detection: 1) Coverage-Based Methods (Ma et al. 2018; Pei et al. 2017): These methods assess the coverage of DNN neurons to identify potentially misclassified test inputs; 2) Surprise Adequacy-Based Methods (Kim et al. 2019): These techniques select test inputs using metrics related to surprise adequacy and activation traces within DNNs; 3) Confidence-Based Approaches (Feng et al. 2020; Ma et al. 2021; Weiss and Tonella 2022): These methods selects tests based on the model’s prediction confidence. Test inputs that the DNN model is more uncertain are selected. Notably, confidence-based metrics have proven to be more effective and efficient than both surprise adequacy and coverage-based approaches, with runtime typically taking less than 1 second in most cases (Feng et al. 2020).

Accuracy estimation approaches aim to select a small set of test inputs to precisely estimate the accuracy of the whole testing set. By only labeling the selected representative tests, it becomes feasible to reduce the labeling expenses. However, existing approaches designed for DNNs, like CES (Li et al. 2019) and PACE (Chen et al. 2020), are not suitable for GNNs due to their design not aligning with graph datasets.

GNNs fundamentally belong to the family of DNN algorithms. They inherit several core concepts from DNNs, such as deep architectures, nonlinear activation functions, and backpropagation algorithms. Therefore, several existing DNN test selection approaches can be applied to GNNs. However, there is a significant gap in adapting DNN test selection methods for GNNs. This challenge arises because, unlike DNNs, where each sample in the test set is treated independently, GNNs exhibit interdependencies among their test inputs (nodes) (Wu et al. 2020). Consequently, it remains unclear whether test selection approaches designed for DNNs can be effectively utilized for GNNs. Therefore, it is crucial to investigate the effectiveness of DNN test selection methods in the context of GNNs. To fill the gap, we conduct an empirical study to evaluate the effectiveness of test selection methods when applied to GNNs. Our research focuses on four key aspects:

  • Test Selection for Misclassification Detection As previously mentioned, confidence-based metrics have demonstrated higher effectiveness and efficiency compared to other existing test selection approaches (Feng et al. 2020). Therefore, we specifically evaluate the effectiveness of various confidence-based test selection approaches for selecting potentially misclassified GNN test inputs.

  • Test Selection for Accuracy Estimation We investigate the effectiveness of various clustering methods for GNN test selection. We extend the concept of model confidence to accuracy estimation, making clustering approaches utilize the model’s prediction probability vector for tests (which reflects model confidence) to conduct clustering.

  • Test Selection for Performance Enhancement (using confidence-based approaches) We investigate the effectiveness of various confidence-based test selection methods, encompassing both approaches for misclassification detection and accuracy estimation, in selecting retraining inputs to enhance the accuracy of GNNs.

  • Test Selection for Performance Enhancement (using approaches based on node importance) We investigate the effectiveness of node importance-based test selection methods in selecting retraining inputs to improve GNN accuracy. This exploration is motivated by three factors: 1) Nodes with high importance typically encapsulate critical information and exert a more pronounced influence over the entire graph. Therefore, these nodes are more likely to capture essential information crucial for enhancing model performance (Park et al. 2019); 2) Unimportant nodes can contain noise or irrelevant data that can introduce interference during retraining, thereby diminishing model performance; 3) Node importance is a unique data feature in GNNs that can be leveraged for selecting crucial tests. To the best of our knowledge, there has been limited or no study investigating whether node importance can be effectively used for selecting retraining inputs, highlighting the necessity of conducting relevant research.

Building upon these four critical aspects, we perform an empirical study that encompasses 7 graph datasets and 8 GNN models, systematically evaluating the performance of 22 test selection approaches. To offer a more comprehensive evaluation, we incorporate not only node classification datasets (Yang et al. 2016) but also graph classification datasets (Riesen and Bunke 2008; Bianchi et al. 2021; Neumann et al. 2016) in our analysis. Our empirical findings reveal that while certain test selection methods demonstrate efficacy in the context of DNNs (Ma et al. 2021), they do not translate to the same level of effectiveness when applied to GNNs. We delve into the underlying reasons for this disparity in the experimental section. To provide a concise summary, we present the following key conclusions.

  • Test Selection for Misclassification Detection In the context of GNNs, confidence-based test selection methods do not exhibit the same level of effectiveness as observed in DNNs.

  • Test Selection for Accuracy Estimation In most cases, clustering-based test selection methods that utilize the model’s confidence vector perform better than random selection. However, their improvements compared to random selection are slight.

  • Test Selection for Performance Enhancement (using confidence-based approaches) The effectiveness of both confidence-based and clustering-based test selection methods shows only slight enhancements when compared to random selection in selecting retraining inputs to improve GNN accuracy, despite some methods having been demonstrated as performing well in DNNs (Hu et al. 2021).

  • Test Selection for Performance Enhancement (using node importance-based approaches) Node importance-based test selection methods are not suitable for selecting retraining data to improve GNN accuracy, and in many cases, they even perform worse than random selection.

Our empirical study provides valuable insights for engineers seeking to apply test selection metrics in GNN contexts. We emphasize the constraints of current test selection approaches for GNNs, thus providing guidance for future research to develop new approaches tailored for GNNs. Our datasets, results, and tools are accessible to the community on GitHubFootnote 1.

In summary, we make the following contributions in this paper:

  • We conduct an empirical study to assess the effectiveness of confidence-based test selection methods in identifying potentially misclassified test inputs for GNNs. Our study reveals that confidence-based test selection methods, which perform well in DNNs, do not demonstrate the same level of effectiveness.

  • We empirically investigate the effectiveness of clustering approaches that utilize model confidence vectors in estimating GNN accuracy. We demonstrate that clustering-based methods, while consistently performing better than random selection, provide only slight improvements.

  • We investigate the effectiveness of misclassification detection approaches and accuracy estimation approaches in selecting retraining inputs to improve GNN accuracy. We find that test selection methods, such as confidence-based and clustering-based test selection methods, demonstrate only slight effectiveness.

  • We investigate the effectiveness of test selection methods based on node importance in selecting retraining inputs to improve the GNN accuracy. The results show that node importance-based test selection methods are not suitable, and in many cases, they even perform worse than random selection.

2 Background

In this section, we present the fundamental domain concepts central to our research. These encompass Graph Neural Networks, Test Selection in DNN Testing, and Active Learning.

2.1 Graph Neural Networks

Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness in addressing machine learning challenges associated with graph-structured data (Zhou et al. 2020; Fan et al. 2019; Sun et al. 2019). These challenges span a variety of domains, including social networks (Li et al. 2017; Wu et al. 2018; Yu et al. 2020), recommendation systems (Ying et al. 2018; Wu et al. 2022; Fan et al. 2019) and bioinformatics (Zhang et al. 2021; Long et al. 2022; Réau et al. 2023). In Fig. 1, we present a general pipeline for GNN models, which includes four main parts: 1) The GNN model receives graph-structured inputs, which can contain nodes and edges (representing the connections between nodes). 2) GNN layers then process this graph-structured data. 3) After multiple layers of processing, the GNN model can generate node/edge/graph embedding vectors. These are low-dimensional vector representations of node/edge/graph, facilitating efficient processing and analysis by GNN models. 4) Utilizing these embedding vectors, the GNN model can address tasks at the node-level, edge-level, or graph-level correspondingly.

Fig. 1
figure 1

The general pipeline for GNN models

In the following, we introduce some fundamental concepts related to GNNs and graph datasets.

Graphs A graph can be formally represented as \(G = (V, E)\), with V representing the set of nodes and E denoting the set of edges that establish connections between these nodes (Dwivedi et al. 2020). Graphs are widely used in various domains (Liu et al. 2020; Jin et al. 2020; Zhou et al. 2020). For instance, in citation networks (Veličković et al. 2017), papers can be represented as nodes linked by citations (edges) and grouped into different categories. In chemistry (Wieder et al. 2020), molecules can be viewed as graphs with atoms as nodes and covalent bonds as edges, simplifying the representation of their 3D structures.

Graph Analytics Tasks GNNs can leverage graph structure and node features to perform various analytics tasks. 1) Node-level classification (Xiao et al. 2022; Zhao et al. 2021), such as categorizing nodes into distinct classes, utilizes individual node predictions. Prevalent datasets for such tasks include Cora (Sen et al. 2008), CiteSeer (Sen et al. 2008), and PubMed (Sen et al. 2008). 2) Graph-level classification aims to determine entire graph attributes, like predicting molecular properties in a chemical graph. Datasets for these tasks include Mutagenicity (Riesen and Bunke 2008), NCI1 (Shervashidze et al. 2011), and MSRC21 (Neumann et al. 2016). 3) Edge-level classification focuses on classifying edge types between two given nodes. For example, in biological networks, GNNs can utilize the information of a protein and a small molecule to predict their binding affinity, which is considered as edges within a graph. Datasets for edge classification include: DrugBank (Wishart et al. 2018) and BindingDB (Liu et al. 2007).

Graph Embeddings (Cai et al. 2018) offer an approach to diminish the dimensions of nodes, edges, and their related attributes while preserving vital structural information and graph characteristics (Fu et al. 2020). In graph embedding, each node or edge is mapped to a vector, typically in low dimensions. This low-dimensional representation effectively captures the relationships and similarities between nodes or edges, enabling more efficient computation and analysis within the vector space.

Message Passing The fundamental concept behind Message Passing in GNNs is to enhance the representation of individual nodes by propagating and aggregating information among neighboring nodes, as described in Wu et al. (2020). For example, when calculating the representation of a node N at time step k, the process involves: 1) Gather information from neighbors: Compute the sum of messages from all neighboring nodes of node N to gather information. 2) Utilize the obtained information: Combine the received messages with the representation of node N at time step \((k-1)\) to compute the representation of node N at time step k.

Applying GNNs in Software Engineering GNNs can be applied to various aspects of the field of software engineering. One prevalent application lies in software vulnerability detection (Cheng et al. 2021, 2022). Cheng et al. (2021) proposed DeepWukong, a novel approach for software vulnerability detection, which utilizes GNNs to encode code fragments into a concise low-dimensional representation. Initially, DeepWukong extracts program slices from code fragments, labeling a slice (or an XFG) as vulnerable if it contains a vulnerable statement. Subsequently, a neural network model is trained using both safe and vulnerable program slices. Both the unstructured and structured code information of a program are incorporated when constructing the neural networks, with both types of information fed into the GNNs to generate a compact code representation in the latent feature space. By leveraging recent advancements in GNNs to learn from vulnerable and safe program slices, DeepWukong enables more precise bug prediction. Cheng et al. (2022) proposed ContraFlow, which overcomes limitations of previous GNN-based software vulnerability detection methods by focusing on preserving value flow paths rather than the entire graph. By employing contrastive learning, ContraFlow efficiently selects feasible value-flow paths in the embedding space to represent a code fragment accurately. ContraFlow can identify potential error paths based on path-sensitive representations and interpret crucial value flow paths causing vulnerabilities.

2.2 Test Selection in DNN Testing

In the context of DNN testing (Haq et al. 2021; Panichella et al. 2017; Dang et al. 2024; Li et al. 2023), test selection (Ma et al. 2021; Hu et al. 2021) focuses on addressing a practical concern: while collecting unlabelled data is easy and cost-effective, labeling all of it demands substantial effort and specialized domain knowledge. This challenge is typically exacerbated by three key factors: 1) Large-Scale Test Sets: Test sets can be extensive, increasing the workload associated with labeling. 2) Manual Analysis as the Primary Labeling Method: The primary method of labeling involves manual analysis, typically requiring the involvement of multiple individuals to ensure accurate labeling. 3) Dependency on Domain-Specific Knowledge: Labeling frequently necessitates domain-specific expertise, resulting in higher costs associated with employing professionals for the task.

Test selection has emerged as a practical solution for dealing with the labelling cost issue. It involves carefully selecting a subset of unlabeled test data to serve two main objectives: testing DNNs and improving the performance of pre-trained DNNs through retraining. Test selection can be broadly categorized into two main aspects:

  • Misclassification Detection (Ma et al. 2021; Feng et al. 2020) This aspect focuses on selecting test inputs that are more likely to be misclassified by the DNN model. These tests are more likely to reveal errors in the DNN model and are therefore referred to as “bug-revealing test inputs". Labeling only these test data can lead to reduced overall labeling costs. Furthermore, in active learning contexts, this test data can then be utilized to enhance the model through retraining (Hu et al. 2021).

  • Accuracy Estimation (Chen et al. 2020; Li et al. 2019) This aspect involves selecting a small set of representative test inputs capable of precisely estimating the accuracy of the entire testing dataset. By labeling only these representative tests, it becomes possible to estimate the accuracy of the entire test set, thus reducing labeling costs.

2.3 Active Learning

Active learning is a well-established concept within both the software engineering (SE) and machine learning (ML) communities (Hu et al. 2021). The fundamental idea behind active learning is to employ machine learning techniques to identify data samples that are relatively challenging to classify (Ren et al. 2021). These samples are then presented for human annotation. The annotated data is subsequently used to further train the target ML models to improve the model’s performance. The primary objective of active learning is to determine which samples should be prioritized for manual labelling, enabling the model to actively select the informative data to train the model (Ranganathan et al. 2017). Existing work (Weiss and Tonella 2022) has demonstrated that test selection methods can be employed for active learning. Weiss and Tonella (2022) empirically investigated the effectiveness of various DNN test selection techniques (e.g., DeepGini and Entropy) in identifying inputs potentially useful for active learning. Their study shows that DeepGini, along with several uncertainty-based methods, can effectively select informative inputs in the context of active learning.

3 Approach

In our study, we assessed a total of 22 approaches, comprising 7 test selection methods for misclassification detection, 5 test selection approaches for accuracy estimation, 7 node importance metrics, and one baseline method (i.e., random selection). We selected these approaches for the following reasons: 1) These approaches are adaptable for the corresponding GNN test selection task. For example, DeepGini, as highlighted in its original paper (Feng et al. 2020), can be used to identify potentially misclassified test inputs; 2) The selected approaches have demonstrated their effectiveness in the context of DNNs (Feng et al. 2020; Ma et al. 2021); 3) The authors of these approaches have made their implementations publicly available. Below, we will provide a detailed explanation of the basic logic behind each test selection method.

3.1 Misclssification Detection Approaches

We employed a total of 10 test selection methods that can be used to detect potentially misclassified GNN tests. One of the classic methods is DeepGini (Feng et al. 2020). Moreover, our empirical study also evaluated several active learning-based test selection strategies (Wang and Shang 2014), including Margin Sampling, Least Confidence, and Entropy. Active learning (Hu et al. 2021) focuses on maximizing model performance gains with minimal sample labeling. Specifically, it aims to select the most valuable samples within an unlabeled dataset and hand them over to the oracle (e.g., human annotator) for labeling, thereby reducing labeling costs while maintaining the model performance. Below, we provide a detailed introduction to the test selection approaches we evaluated.

  • DeepGini (Feng et al. 2020) DeepGini quantifies the uncertainty in a model’s prediction for a given test by calculating the Gini score of this test. This score is derived from the model’s prediction probability vector for the test. A higher Gini value indicates that the model is more uncertain on the specific test. Therefore, the test is considered more likely to be misclassified. The computation of the Gini score is illustrated in Formula (1).

    $$\begin{aligned} G(t)=1-\Sigma _{i=1}^N p_{t, i}^2 \end{aligned}$$
    (1)

    where N represents the number of prediction classes, and \(p_{t, i}\) represents the probability that the model will classify the test t into class i.

  • Margin Sampling (Wang and Shang 2014) Margin sampling is an uncertainty-based active learning strategy. Its core idea is to select samples that the model finds most challenging to classify for labeling. Margin Sampling focuses on the difference in the model’s predicted probabilities for the two most confident classes. The smaller this probability gap, the more uncertain the model is about the classification of that sample. The uncertainty score of Margin Sampling is calculated by Formula (2).

    $$\begin{aligned} Margin(t)=p_{k}(t)-p_{j}(t) \end{aligned}$$
    (2)

    where \(p_{k}(t)\) refers to the model’s predicted probability for the most confident classification. \(p_{j}(t)\) refers to model’s predicted probability for the second most confident classification

  • Least Confidence (LC) (Wang and Shang 2014) Least Confidence is an active learning strategy based on model uncertainty. Specifically, it selects samples for which the model’s prediction is the least confident for labeling. In a classification task, if a model has a low maximum predicted probability value for a specific unlabeled sample, it indicates that the model is highly uncertain about the classification of that sample. The Least Confidence strategy selects such samples for labelling. The score of Least Confidence is computed using Formula (3).

    $$\begin{aligned} L(t) = 1- \max _{i=1: n} p_{i}(t) \end{aligned}$$
    (3)

    where \(p_i(t)\) represents the probability of test input t being classified into category i. Hence, \(\max _{i=1: n} p_{i}(t)\) represents the model’s predicted probability for the most confident classification.

  • Least Confidence-variant (LC-variant) (Wang and Shang 2014) In contrast to the Least Confidence metric, which ranks classifications based on the most confident predictions, the Least Confidence-variant model assesses uncertainty by focusing on the model’s least confident prediction category. This variant considers that when the difference between the model’s prediction probability for the least confident classification and 0 is large, it signifies that the model is more uncertain about this test, and this test is more likely to be misclassified. The formula for this variant is provided in Formula (4). The rationale behind this variant is rooted in the concept of uncertainty, as discussed in previous studies Feng et al. (2020). Specifically, considering a classifier M capable of classifying test inputs into N categories, when the prediction probability vector of M for a test t is (\(\frac{1}{N}\), \(\frac{1}{N}\), ..., \(\frac{1}{N}\)), it signifies that classifier M is the most certain about this test t. Since the highest value that the model’s prediction probability can reach for its least confident classification is \(\frac{1}{N}\), when the model’s prediction probability for the least confident category is higher, it suggests that the model’s confidence for the least confident category approaches \(\frac{1}{N}\). This suggests that the model exhibits greater uncertainty when predicting this test case. This test is considered more likely to be misclassified.

    $$\begin{aligned} L(t) = \min _{i=1: n} p_{i}(t) - 0 \end{aligned}$$
    (4)

    where \(p_i(t)\) represents the probability of test input t being classified into category i. Hence, \(\min _{i=1: n} p_{i}(t)\) represents the model’s predicted probability for the least confident classification.

  • Entropy (Weiss and Tonella 2022) Entropy is a commonly used method in active learning. It can measure the uncertainty of a model’s predictions for a given sample. The entropy method selects samples by calculating the entropy value of the model’s predictions for each unlabeled sample. For a given sample, a high entropy value indicates that the model is highly uncertain about the classification of that sample. Therefore, this strategy tends to select samples with high entropy values for labeling, with the aim of improving the model’s performance by adding information from these highly uncertain samples.

  • Multiple-Boundary Clustering and Prioritization (MCP) (Shen et al. 2020) MCP is an extension of the Margin Sampling. It begins by dividing the data into distinct “boundary areas" based on the top-2 predicted classes. Then, MCP selects data points from each area based on the Margin. The selected data points are considered to be tests for which the model exhibits a higher degree of uncertainty. These tests are considered more likely to be misclassified.

  • Variance (Ma et al. 2021) For a given test case, Variance quantifies the uncertainty in the model’s predictions by computing the variance of the model’s prediction probabilities for that specific test. A smaller variance suggests that the model exhibits greater uncertainty regarding this test, and this test is considered more likely to be misclassified. The formula for Variance is provided in Formula (5).

    $$\begin{aligned} {\text {Var}}(x)=\frac{1}{N} \sum _{i=1}^N {\text {var}}\left( p_i(t)\right) \end{aligned}$$
    (5)

    where N represents the number of test inputs in the test set. where \(p_i(t)\) represents the probability of test t being classified into category i.

  • ATS (Gao et al. 2022) ATS is the first adaptive test selection method designed for DNNs, which utilizes differences in model outputs to measure the diversity of behaviors of DNN test inputs. The objective of ATS is to select more diverse tests from the candidate set, as these tests can reveal more different faults in the DNN-driven software.

  • GraphPrior (Dang et al. 2023) GraphPrior is a test prioritization method specifically designed for GNNs. It utilizes mutation testing to prioritize potentially misclassified test inputs. Specifically, given a test set and a GNN model under testing, GraphPrior generates mutated models based on the original GNN model. GraphPrior assumes that a test input is more likely to be misclassified if it can “kill” many mutated models. Based on this assumption, it identifies and prioritizes possibly misclassified tests.

  • Random selection (Elbaum et al. 2002) Through the baseline random selection, tests are selected randomly from the test set.

3.2 Accuracy Estimation Approaches

The aforementioned confidence-based methods rely on the model’s prediction probability vector to assess whether a test is prone to being incorrectly predicted. These methods are efficient and consume minimal time since they only use the model’s final prediction probability vector and mathematical approaches for estimating uncertainty. Based on existing research (Chen et al. 2020), clustering is a practical approach for test selection to estimate the accuracy of a test set. Clustering groups similar data points together, allowing for the extraction of representative points from each cluster, which can effectively represent the entire test set. Therefore, we empirically explore the combination of clustering methods with prediction probability vectors for test selection in the context of graph networks to estimate the accuracy of the test set. Below, we introduce all the clustering methods used in our study.

  • K-Means (Ahmed et al. 2020) K-Means is an unsupervised clustering algorithm. The algorithm initially divides the data into K groups and randomly selects K objects as the initial cluster centers. It then computes distances between each point and all the cluster centers, assigning each point to the closest center. Subsequently, the algorithm recalculates the centroid of each cluster. This process continues to iterate until a specific termination condition is met.

  • K-Means Plus (Arthur and Vassilvitskii 2007) K-Means Plus is an extension of the K-Means algorithm, primarily enhancing the way initial cluster centers are chosen. In the traditional K-Means algorithm, initial cluster centers are randomly chosen, which can lead to different results in different runs and can affect the algorithm’s convergence speed and clustering quality. K-Means++ addresses this issue by intelligently selecting the initial cluster centers, aiming to enhance the algorithm’s performance.

  • MiniBatch K-means (Sculley 2010) MiniBatch K-means is an optimized variant of the K-Means algorithm designed for efficiently handling large-scale data, reducing computational time. It utilizes mini-batches, which are small, random, fixed-size data subsets, to manage data in memory. During each iteration, the algorithm gathers a random sample of the data and employs it to update the clusters.

  • Gaussian Mixture Model (GMM) (Patel and Kushwaha 2020) The Gaussian Mixture Model is a probabilistic model that posits that all data points are generated by a mixture of finite Gaussian distributions with unknown parameters. It can be thought of as an extension of K-means clustering that incorporates information about the data’s covariance structure and potential Gaussian distribution centers.

  • Hierarchical Clustering (Kaushik and Mathur 2014) Hierarchical Clustering is a versatile clustering algorithm that iteratively combines or divides clusters to create nested structures. The hierarchical organization in Hierarchical Clustering is visualized as a tree, with the root representing the cluster containing all samples and the leaves representing clusters with only one sample each.

3.3 Node Importance Metrics

In RQ4, we employed seven approaches to measure node importance in order to perform test selection. These methods were extracted from existing studies (Hu et al. 2015; Qiong and Dongxia 2016; Yang et al. 2019; Ando et al. 2021).

  • Degree Degree measures the importance of a node based on the number of edges surrounding the node. Nodes with a higher number of edges are considered more important.

  • Eccentricity Eccentricity quantifies a node’s importance by assessing the longest distance from that node to all other nodes. Nodes with small eccentricity values are deemed more crucial, as they play a pivotal role in connecting various components and influencing information dissemination.

  • Center The Center approach assesses the importance of a node by calculating its distance from the network center. Nodes closer to the center are considered more important. Center posits that nodes closer to the center have a greater influence and significance in terms of network connectivity and information propagation.

  • Betweenness Centrality (BC) BC assesses the importance of a node by evaluating its role as an intermediary within the network. The node’s betweenness centrality depends on the number of times it acts as a transit point along the shortest paths in the network. A higher betweenness centrality indicates that the node plays a more crucial role in connecting paths between different nodes in the network, and therefore, it is considered more important.

  • Eigenvector Centrality (EC) Eigenvector Centrality associates a node’s importance with the degree to which it is connected to other important nodes. The centrality of a node is determined by the importance of the nodes it is linked to; if a node is connected to others with high Eigenvector Centrality, it will also be considered more important.

  • PageRank PageRank evaluates the relative significance of nodes in a graph by considering their connectivity and the influence of nodes linked to them. A node’s PageRank value depends on both its number of connections and the importance of the nodes that are connected to it. Nodes connected to nodes with higher PageRank values are regarded as more important in this ranking method.

  • Hyperlink-Induced Topic Search (Hits) Hits determine the importance of nodes through two metrics: Authorities and Hubs. Authorities are assessed based on the quantity and quality of inbound links a node receives, measuring its role as a source of information. Hubs, on the other hand, are evaluated based on the quantity and quality of outbound links, gauging their role as intermediaries in information dissemination. These two metrics interact and are jointly used to assess the relative importance of nodes in the graph network.

4 Study Design

4.1 Overview

Similar to traditional deep neural networks (DNNs), testing Graph Neural Networks (GNNs) also faces challenges due to the absence of automated testing oracles. This leads to the need for manual labeling of test inputs, a process that can be labor-intensive, especially for large and intricate graphs. Furthermore, in specialized domains like drug discovery, as exemplified by protein interface prediction (Jha et al. 2022), labeling heavily relies on domain-specific knowledge, further escalating costs. In response to the labeling cost issue, existing studies mainly focus on two motivations in the field of DNN testing selection: misclassification detection and accuracy estimation.

  • Misclassification Detection Misclassification detection aims to select test inputs that are more likely to be misclassified by the DNN model. These selected tests serve two primary purposes: 1) Testers can use them for debugging DNN-based software to enhance the quality of DNNs, and 2) Testers can employ them for DNN model retraining, effectively reducing the cost associated with retraining.

  • Accuracy Estimation Accuracy Estimation aims to select a small set of representative test inputs capable of providing an accurate estimate of the entire test set’s accuracy.

However, a notable gap exists in adapting DNN test selection methods for GNNs. This challenge emerges due to the distinct nature of GNN test data, where test inputs (nodes) are interconnected, unlike DNNs, where each test sample is treated independently. Consequently, it remains uncertain whether test selection approaches originally tailored for DNNs can be suitably applied to GNNs. To fill the gap, we conduct an empirical study to assess the effectiveness of test selection methods when employed within the context of GNNs, including confidence-based approaches, clustering-based approaches, and node-importance-based approaches.

Fig. 2
figure 2

Overview of our empirical study

Figure 2 presents an overview of our empirical study. Our study initially focused on three crucial aspects of GNN test selection: GNN accuracy estimation, GNN misclassification detection, and GNN performance enhancement. Specifically, RQ1 focuses on misclassification detection. RQ2 corresponds to accuracy estimation. RQ3 and RQ4 target GNN performance enhancement. In the following, we provide a detailed description of each research question.

  • RQ1: Misclassification Detection. We evaluate the effectiveness of confidence-based test selection methods for identifying potentially misclassified GNN test inputs, building on their demonstrated efficiency in previous work (Feng et al. 2020).

  • RQ2: Accuracy Estimation. We extend the concept of model confidence for accuracy estimation, evaluating the effectiveness of various clustering methods that utilize the model’s confidence vector in estimating the accuracy of the GNN test set.

  • RQ3: Performance Enhancement (using confidence-based methods). We assess the effectiveness of various test selection approaches, encompassing both misclassification detection and accuracy estimation approaches, in selecting retraining inputs for enhancing GNN accuracy.

  • RQ4: Performance Enhancement (using node importance-based methods). We investigate the effectiveness of node importance-based test selection methods in selecting retraining inputs for improving GNN accuracy. This is motivated by the fact that nodes with high importance tend to capture critical information, while low-importance nodes can introduce noise during retraining. Leveraging node importance in GNNs for test selection is a novel and unexplored area of research.

To provide a more comprehensive assessment, we conducted experiments using a diverse set of 7 graph datasets with 8 GNN models to evaluate the performance of 20 test selection approaches. It is important to emphasize that our dataset includes not only widely adopted node-level datasets but also graph-level datasets in order to ensure a robust evaluation of our methodology. By analyzing the performance of current test selection approaches for GNNs, we aim to investigate the limitations of existing test selection methods in the context of GNNs and provide insights for the future development of novel GNN-oriented test selection methods.

4.2 Research Questions

Our experimental evaluation answers the research questions below.

  • RQ1: How effective are different test selection metrics in detecting misclassified test inputs for GNNs? Test selection has emerged as a promising approach for reducing the labelling cost in the testing process. While several test selection techniques have been proposed in the context of DNN testing, their adaptation to GNNs poses distinctive challenges owing to the differences between the test data for DNNs and GNNs. In particular, DNN test inputs are typically independent of one another, whereas GNN test inputs, represented as nodes, exhibit complex interdependencies. Consequently, it remains uncertain whether the DNN test selection methods can perform well on GNNs. In this research question, we assess the effectiveness of multiple test selection approaches in identifying test inputs that are more likely to be misclassified within the context of GNNs.

  • RQ2: How do various accuracy estimation methods perform when applied to GNNs? Test selection approaches for accuracy estimation are designed to select a subset of test inputs that can effectively estimate the accuracy of the entire testing set. By labeling only the selected tests, the labeling costs can be reduced. In this research question, we empirically assess various test selection approaches in estimating the accuracy of GNNs.

  • RQ3: How do different test selection approaches perform in selecting informative inputs for retraining GNN models? In this research question, we investigate the effectiveness of diverse confidence-based test selection methods in selecting retraining inputs for GNN accuracy improvement. These methods encompass misclassification detection approaches (RQ1) and accuracy estimation approaches (RQ2).

  • RQ4: To what extent can node importance guide the selection of retraining inputs for GNNs? In this research question, we explore the effectiveness of node importance-based methods in selecting inputs to enhance GNN accuracy. This exploration is motivated by: 1) Nodes with high importance typically contain crucial information and have a significant impact on the entire graph, making them valuable for improving model performance; 2) Conversely, unimportant nodes can introduce noise or irrelevant data during retraining, potentially degrading model performance; 3) Node importance in GNNs remains unexplored for selecting crucial tests. Our research aims to address this gap.

4.3 GNN Models and Datasets

In our experiments, we utilized 7 graph datasets and 8 GNN models to assess the performance of 20 test selection approaches. Detailed information about each dataset and model is elaborated upon in the subsequent sections.

4.3.1 Graph Datasets

To provide a more comprehensive evaluation, our dataset encompasses not only widely adopted node-level datasets but also edge-level and graph-level datasets. Node-level tasks are centered on making predictions for individual nodes within a graph. The node-level datasets we utilized consist of Cora (Yang et al. 2016), CiteSeer (Yang et al. 2016), and PubMed (Yang et al. 2016). Edge-level datasets focus on predicting edge types between two given nodes. Our adopted edge-level datasets are DrugBank (Wishart et al. 2018) and BindingDB (Liu et al. 2007). In contrast, graph-level tasks are oriented towards predicting global properties or characteristics of an entire graph. Our selection of graph-level datasets includes Mutagenicity (Riesen and Bunke 2008), NCI1 (Shervashidze et al. 2011), GraphMNIST (Bianchi et al. 2021), and MSRC21 (Neumann et al. 2016).

1) Node Classification Datasets

  • Cora (Yang et al. 2016) The Cora dataset comprises 2,708 scientific publications (nodes) and 5,429 links (edges) representing citations between them. Nodes represent machine learning papers, and edges indicate citations between pairs of papers. Each paper is categorized into one of seven classes, including topics like reinforcement learning and neural networks.

  • CiteSeer (Yang et al. 2016) The CiteSeer dataset comprises 3,327 scientific publications (nodes) and 4,732 links (edges). Each paper belongs to one of six categories (e.g., artificial intelligence and machine learning).

  • PubMed (Yang et al. 2016) The PubMed dataset contains 19,717 diabetes-related scientific publications (nodes) connected by 44,338 links (edges). Publications are classified into three classes (e.g., Cancer and AIDS).

2) Graph Classification Datasets

  • Mutagenicity (Riesen and Bunke 2008) The Mutagenicity dataset presents a diverse collection of 4,337 small molecule graphs, each belonging to one of two distinct classes. It serves as a valuable resource for exploring the mutagenic properties of these molecules, offering insights into their potential health and environmental implications.

  • NCI1 (Shervashidze et al. 2011) NCI1 encompasses 4,110 small molecule graphs, comprising 407 unique molecules classified into two fundamental categories: toxicity and biological relevance. This dataset plays a crucial role in toxicity prediction and drug discovery efforts.

  • GraphMNIST (Bianchi et al. 2021) GraphMNIST stands as a significant resource in the field of computer vision, consisting of a vast database of handwritten digits. It comprises 412 instances across ten distinct classes, corresponding to integer values from 0 to 9.

  • MSRC21 (Neumann et al. 2016) The MSRC21 dataset is a comprehensive compilation of 563 real-world network graphs from the field of computer vision.

3) Edge Classification Datasets

  • DrugBank (Wishart et al. 2018) The DrugBank dataset is a multi-class classification dataset primarily focused on drug-drug interactions (DDIs). It involves predicting the interaction type between pairs of drugs given their SMILES strings. Compiled manually from FDA/Health Canada drug labels and original literature, the dataset encompasses 86 distinct interaction types, covering a total of 191,808 DDI pairs involving 1,706 unique drugs.

  • BindingDB (Liu et al. 2007) BindingDB is a public, web-accessible database dedicated to measuring binding affinities. It primarily focuses on the interactions between proteins considered to be drug targets and drug-like small molecules. In our experiment, we classified edges based on the magnitude of their binding affinities for the edge classification task.

4.3.2 GNN Models

  • GCN (Kipf and Welling 2016) GCN is a specialized type of convolutional neural network designed to operate directly on graph structures. It addresses the task of classifying nodes within graphs, such as documents in citation networks, where only a limited number of nodes have labels. The fundamental concept behind GCN involves leveraging the relationships between edges in a graph to consolidate node information and produce updated node representations. GCN has found application in various research studies, as evidenced by its inclusion in prior works (He et al., 2020; Hong et al., 2020).

  • GAT (Veličković et al. 2017) The inception of GAT arose from the necessity to enhance traditional Graph Convolutional Networks (GCN). GCN considers all neighboring nodes as equally important. However, in practical scenarios, different neighboring nodes can hold different degrees of significance. As a result, GAT incorporates a self-attention mechanism that assigns individualized attention scores to each neighbor. Consequently, GAT excels in identifying and prioritizing the most crucial neighbors during the information aggregation process.

  • Graph Isomorphism Network (GIN) (Xu et al. 2018) GIN is designed for processing graph data and solving the graph isomorphism problem. Its working principle involves learning the structural information and connectivity patterns among nodes in a graph, enabling effective identification and comparison of isomorphism between different graphs. The core idea of GIN is to iteratively aggregate feature information from nodes within the graph, capturing and representing essential features of the entire graph.

  • Higher-order Graph Neural Networks (GraphNN) (Morris et al. 2019) GraphNN is an advanced class of graph-based machine learning models that extend traditional GNNs to capture intricate higher-order relationships within graph-structured data.

  • Message Passing Neural Networks (MPNNs) (Gilmer et al. 2017) MPNN is a general framework for supervised learning on graphs structured data. It is based on the commonness between several state-of-the-art graph-based neural models.

  • Attention-based Graph Neural Network (AGNN) (Thekumparampil et al. 2018) AGNN is a neural network architecture designed for graph data analysis. Its distinctive feature is the complete removal of traditional fully connected intermediate layers, replaced with attention mechanisms to better preserve the information within the graph structure.

  • Graclus GNNs (Mesquita et al. 2020) Graclus GNNs is an approach that integrates the Graclus graph clustering algorithm with GNNs. Graclus is utilized for partitioning a given graph into clusters or communities based on node similarity or relationships. In this model, the graph data undergoes pre-processing with Graclus.

  • GNNs with convolutional ARMA filters (ARMA) (Bianchi et al. 2021) ARMA refers to an optimized GNN architecture with a new graph convolutional layer inspired by the auto-regressive moving average (ARMA) filter. ARMA brings significant improvements for node classification, graph classification, etc.

  • GSAGE-E (Hamilton et al. 2017) Graph Sample and Aggregate (GraphSAGE) generates embeddings for nodes by accumulating and integrating characteristics from their adjacent nodes. GraphSAGE samples a predetermined quantity of neighbors for each node. GSAGE-E is a variant model of GraphSAGE aimed at edge classification tasks. In this model, the fused information of two nodes (i.e., concatenating the vectors of two nodes) is utilized to predict the category of the edge between them.

  • TAGCN-E (Du et al. 2017) The Topology Adaptive GCN (TAGCN) employs a collection of learnable filters, each of a fixed size, to execute convolutional operations on graph structures. These filters adapt to the unique topology of the graph during the convolution process. TAGCN-E is a variant model of TAGCN that focuses on edge classification. In TAGCN-E, the fused information of two nodes is utilized to predict the category of the edge between them.

4.4 Measurements

4.4.1 Percentage of Fault Detected (PFD)

Following the prior research (Feng et al. 2020), we employ PFD to assess the effectiveness of various test selection methods in detecting misclassified test inputs. The computation of PFD is represented in Formula (6). From a mathematical standpoint, PFD measures the ratio of correctly detected misclassified test inputs to the total number of misclassified tests within the test set. A higher PFD value indicates that the evaluated test selection approach is more effective at identifying misclassified inputs.

$$\begin{aligned} PFD = \frac{\#T_{detect}}{\#T_{mis}} \end{aligned}$$
(6)

where \(\#T_{detect}\) represents the number of detected misclassified test inputs, while \(\#T_{mis}\) denotes the total number of misclassified test inputs in the test set. In our study, we assessed the PFD values of different test selection approaches under varying ratios of prioritized tests

4.4.2 Root Mean Square Error

The root mean square error (RMSE) measures the average difference between the estimated accuracy and the actual accuracy of a test set. The calculation formula is shown in Formula (7). A lower RMSE value indicates that the selected test inputs can predict the accuracy of the entire test set more accurately, indicating that the utilized test selection method is more effective.

$$\begin{aligned} RMSE=\sqrt{\frac{1}{n} \sum _{i=1}^n\left| a \hat{c} c_i-a c c\right| ^2} \end{aligned}$$
(7)

where acc refers to the actual accuracy, and \(a\hat{c}c\) refers to the estimated accuracy.

4.5 Implementation and Configuration

This project is implemented using the PyTorch 1.11.0 and PyTorch Geometric 2.1.0 framework. We integrated the available implementations of the test selection approaches (Feng et al. 2020; Ma et al. 2021; Hu et al. 2021) into our experimental pipeline. To implement the clustering-based test selection methods, we utilized the package scikit-learn 1.0.2. To implement node importance metrics, we employed the package networkx 2.6.3. Our experiments were conducted on a high-performance computer cluster, with each cluster node equipped with a 2.6 GHz Intel Xeon Gold 6132 CPU and an NVIDIA Tesla V100 16G SXM2 GPU. For data processing tasks, we conducted corresponding experiments on a MacBook Pro laptop running Mac OS Big Sur 11.6, equipped with an Intel Core i9 CPU and 64 GB of RAM.

Table 1 Effectiveness of misclassification detection approaches with respect to random selection (baseline) in terms of PFD
Table 2 Comparative effectiveness of misclassification detection approaches relative to baseline (normalization analysis)

5 Results and Analysis

5.1 RQ1: Test Selection for GNN Misclassification Detection

Objectives: We investigate the effectiveness of 8 confidence-based test selection methods for GNNs in the context of node classification and graph classification tasks, respectively.

Experimental Design: In the first step, we collected 10 test selection methods from existing studies (Ma et al. 2021; Feng et al. 2020; Weiss and Tonella 2022) that can be adapted for GNN misclassification detection. These approaches have been proven effective in the context of DNNs. Moreover, we also evaluated a test prioritization method specifically designed for GNNs, called GraphPrior (Dang et al. 2023), and compared its effectiveness with these DNN test prioritization methods. To provide a more comprehensive evaluation, we include not only node classification datasets but also edge classification and graph classification datasets in our analysis. Following the methodology of previous research (Feng et al. 2020), we utilized the PFD metric to evaluate the effectiveness of various test selection methods in selecting misclassified test inputs. PFD directly measures the ratio of correctly identified misclassified test inputs to the total number of misclassified tests within the test set. Hence, it provides a straightforward reflection of the effectiveness of test selection methods. A higher PFD value indicates that the evaluated test selection approach is more effective at detecting misclassified inputs. Moreover, in order to more clearly demonstrate the difference in effectiveness between the test selection method and the baseline method (random selection), we performed normalization to the experimental results (using Formula (8) Ali et al. 2014) and reported the results.

$$\begin{aligned} x_{\text {normalized}} = \frac{x - x_{\text {min}}}{x_{\text {max}} - x_{\text {min}}} \end{aligned}$$
(8)

where \( x \) is the original value. \( x_{\text {min}} \) is the minimum value within all the values. \( x_{\text {max}} \) is the maximum value within all the values. \( x_{\text {normalized}} \) is the resulting normalized value.

Results: The results of RQ1 are presented in Tables 1, 2, 3, and Fig. 3. Table 1 presents the effectiveness of various test selection approaches on graph datasets across three different classification tasks: node classification, edge classification, and graph classification datasets. We shaded the approach with the highest effectiveness for each case in gray. On the node classification dataset, we highlighted in bold the method that performs best among all approaches not specifically designed for GNNs.

From Table 1, we see that on the node classification datasets (i.e., Cora, CiteSeer, and PubMed), GraphPrior, specifically designed for GNNs, demonstrates the highest effectiveness across all cases. Furthermore, among all approaches not specifically designed for GNNs, Margin performs the best in the majority of cases (90% among all cases). Similarly, on the graph classification datasets, the best-performing test selection method is also Margin. On the edge classification datasets (i.e., DrugBank and BindingDB), the best-performing method is the least confidence, which performs the best across all cases.

Table 3 Effectiveness comparison of misclassification detection approaches on node and graph classification tasks, respectively
Fig. 3
figure 3

Percentage of Fault Detected (y-axis) with different test selection approaches given the ratio of tests executed (x-axis)

Table 2 presents the normalization results for the sum of PFDs for all test selection methods. We utilize random selection as the baseline for normalization. Hence, the normalization results for random selection (baseline) are consistently 0 across all subjects. Detailed normalization calculation methods are provided in the experimental design of RQ1. In this context, if the value for a test selection approach is closer to 1, it indicates that the effectiveness of this test selection method is higher. The experimental results confirm the above conclusions that, on the node classification datasets, GraphPrior, specifically tailored for GNNs, performs as the most effective method in each case. Among the approaches not specifically designed for GNNs, Margin outperforms others in most instances. On graph classification datasets, Margin also performs as the top-performing test selection method. For edge classification datasets, least confidence performs as the most effective approach.

Table 3 provides a more detailed breakdown of the effectiveness of various test selection methods across different classification tasks, including the node-level, edge-level, and graph-level classification tasks. The method that performs the best in each case is still highlighted in gray, and in node classification datasets, the best-performing method among all methods not specifically designed for GNNs is also highlighted in bold. In Table 3, we see that, in the node classification datasets, the best-performing method is GraphPrior, which is specifically designed for GNNs. Among all methods not specifically designed for GNNs, Margin performs the best. In the edge-level datasets, Least Confidence and DeepGini perform the best. In graph-level datasets, Margin performs the best. This further confirms the conclusions obtained above.

However, we find that in GNNs, uncertainty-based test selection methods (such as Margin Sampling) perform less effectively compared to their performance in the context of traditional DNNs. Based on the findings from previous work (Feng et al., 2020), DeepGini can achieve a PFD of around 90% when selecting 30% of the data, which means that DeepGini can detect about 90% misclassified tests when selecting 30% of tests from the test set. However, as suggested in Fig. 3, which visually illustrates the effectiveness of different test selection methods, DeepGini can only detect around 50% of misclassified tests when selecting 30% of tests in the context of GNN test selection. Even the best-performing test selection method, Margin Sampling, can only detect approximately 50% to 70% of misclassified tests, significantly lower than its performance on DNNs. Below, we analyze the reasons for the reduced performance of uncertainty-based methods.

There are four potential factors that hinder confidence-based approaches from achieving the same level of effectiveness as in DNNs. In test selection: 1) they do not account for the interdependencies among test inputs (nodes) within the GNN test set, which are crucial for GNN model inference. Confidence-based prioritization approaches typically function on test sets where each test is treated as independent; 2) Irregular Data: Graph data is typically irregular, with varying numbers of connections and neighbor nodes for each node. This irregularity adds complexity to the application of confidence-based approaches to graphs, making it potentially challenging to effectively capture this complexity; 3) Local and Global Dependencies: Graph data typically exhibit both local and global dependencies. Node attributes and connections can introduce complexity to confidence-based methods since it is challenging to capture these multi-scale dependencies; 4) Size and complexity of graphs. Graphs can exhibit different sizes and complexities. Confidence-based methods can be affected when applied to graph datasets of different sizes and complexities.

figure l
Table 4 Effectiveness of accuracy estimation approaches with respect to random selection (baseline) in terms of RMSE
Table 5 Average Effectiveness of accuracy estimation approaches with respect to random selection (baseline) in terms of RMSE
Table 6 Effectiveness comparison among accuracy estimation approaches on node, graph, and edge classification, respectively

5.2 RQ2: Test Selection for GNN Accuracy Estimation

Objectives: We evaluate the effectiveness of various clustering methods that utilize the model’s confidence vector in estimating the accuracy of the GNN test set.

Experimental Design: In the initial step, we selected five widely recognized clustering algorithms. For each test instance in the test set, we obtain the model’s prediction probability vector, which can reflect the model’s confidence in its predictions. We call this vector the confidence vector. Following this, we utilize each clustering algorithm to group these instances based on their respective confidence vectors. Subsequently, we select N central points from each cluster. These chosen test instances form a subset of the original test set and can then be employed to predict the overall accuracy of the entire test set.

Fig. 4
figure 4

Root Mean Squared Errors(y-axis) of different test selection approaches given the number of tests selected (x-axis)

Results: The results pertaining to RQ2 are presented in Tables 4, 5, 6, and Fig. 4. Specifically, Table 4 exhibits the effectiveness of different test selection approaches related to random selection. In Table 4, the values represent the effectiveness of each test selection approach relative to random selection. Specifically, the calculation process is illustrated in Formula (9). It is important to note that when using RMSE values to measure effectiveness, a smaller RMSE implies higher effectiveness for a given test selection method. Therefore, in Formula (9), if the diff for a test selection method TS is positive, it indicates that the sum of RMSE values for TS is lower than that of random selection, suggesting that the effectiveness of TS is higher than random selection. In the case where TS’ diff is positive, if the diff is larger, it indicates that the RMSE values of TS compared to those of random selection are smaller. Since smaller RMSE implies higher effectiveness, it suggests that the effectiveness of TS relative to random selection is higher.

$$\begin{aligned} diff =\sum _{r=10 }^{100 }\left( RMSE_{Random}^{r} - RMSE_{TS}^{r}\right) \end{aligned}$$
(9)

where r represents the number of tests selected. For example, if \(r = 80\), it indicates that 80 tests are selected from the test set. \(RMSE_{TS}^i\) refers to the effectiveness (measured by RMSE) of the test selection approach TS when selecting r test inputs. \(RMSE_{Random}^i\) refers to the effectiveness (measured by RMSE) of random selection when selecting r test inputs.

In Table 4, we see that, on node classification datasets (i.e., Cora, CiteSeer, and PubMed), the clustering-based test selection methods perform better than random selection in the majority of cases (85%). Similarly, on graph classification datasets (Mutagenicity and NCI1), the clustering methods consistently perform better than random selection. Table 6 further illustrates the effectiveness of different clustering-based test selection approaches, with the best-performing method highlighted in gray for each case. In Table 6, the “Number of Selected Test Inputs” indicates the number of tests selected from the test set. We see that, across both node classification and graph classification tasks, clustering-based test selection methods consistently perform the best.

However, the improvement achieved by clustering-based test selection methods compared to random selection is marginal. For example, when selecting ten tests, in terms of RMSE, the best clustering-based method only exceeds random selection by approximately 0.03, and when selecting 80 tests, the best clustering-based method only surpasses random selection by around 0.01. Similarly, Fig. 4 visually confirms these conclusions, with the blue line representing the baseline (i.e., random selection). We see that while all clustering methods are effective in most cases, the improvements achieved are slight. Moreover, Table 5 presents the average differences between all test selection methods and random selection across all cases. The difference values are all around 0.01. Hence, we conclude that clustering-based methods are effective in selecting representative test data for node classification and graph classification datasets but achieve limited improvements.

In the above, we analyzed the effectiveness of the clustering-based test selection approach in node classification and graph classification tasks. Next, we focus on the effectiveness of clustering methods in edge classification datasets (BindingDB). In Table 4, we see that, on edge classification datasets, the clustering-based test selection approaches perform better than random selection in 40% of cases. Furthermore, Table 6 further highlights the effectiveness of clustering methods across different classification tasks. We see that, in edge-level tasks, random selection exhibits better performance in most cases. From selecting 30 tests and 40 tests up to selecting 100 tests, random selection consistently shows the best performance. This implies that, in the majority of cases, the clustering-based test selection method does not perform as well as random selection on edge classification datasets. In the following, we analyze the potential reasons:

In graph datasets, since a node can be connected to multiple other nodes, the number of edges can far exceed the number of nodes, making the information on edges more complex and diverse, leading to uneven data distribution. Clustering algorithms typically aim to group data points into collections with higher similarities. However, when the data distribution is uneven, clustering algorithms can have difficulty effectively assigning data to the correct clusters. This leads to poor performance when using clustering-based test selection methods in edge classification tasks.

figure p
Table 7 Effectiveness of test selection approaches with respect to random selection (baseline) in selecting retraining inputs to improve GNN accuracy

5.3 RQ3: Confidence-Based Test Selection for GNN Performance Enhancement

Objectives: We evaluate the effectiveness of various test selection methods derived from the two aforementioned research questions in selecting informative retraining inputs to enhance GNN model performance. Specifically, these methods correspond to the test selection approaches for misclassification detection (RQ1) and accuracy estimation (RQ2).

Experimental Design: In previous research questions, we assessed multiple test selection methods tailored for misclassification detection and accuracy estimation. In this research question, we apply these methods to select tests for the retraining of the original GNN model, with the objective of improving its prediction accuracy.

The steps and methods we employed for retraining follow the existing study of DNN test selection (Ma et al. 2021). In the initial phase, given a GNN model M and a graph dataset, we partition the dataset into a training set, a candidate set, and a test set. The test set remains untouched throughout the process. First, we train an initial GNN model using the training set and record the initial accuracy of M on the test set. Subsequently, we apply various test selection methods to select different subsets of data from the candidate set. We then utilize the selected test data to retrain the original GNN model, recording the model’s accuracy after each retraining. By observing the improvement in model accuracy after retraining with data selected using different test selection methods, we can assess and compare the effectiveness of these data selection methods.

Fig. 5
figure 5

Test accuracy (y-axis) achieved by different data selection approaches given the percentage of retrain data selected (x-axis)

Additionally, in the retraining experiments (RQ3), the initial accuracy of the utilized GNN models are: on the CiteSeer dataset: 40% to 60%, on the PubMed dataset: 65% to 70%, on the Cora dataset: 35% to 65%, and on the GraphMNIST dataset: 65% to 70%. The evaluated models’ original accuracy range follows the work of Hu et al. (2021).

Results: The experimental results for RQ3 are presented in Table 7 and Fig. 5. Table 7 presents the effectiveness of all test selection methods in terms of their relative improvement or decline compared to the baseline method (i.e., random selection). We calculate the improvement of each test selection method relative to random selection using Formula (10). In Table 7, if a test selection method outperforms random, its value is positive and highlighted in gray; conversely, if it performs worse, its value is negative and highlighted in white.

$$\begin{aligned} imp=\sum _{i=1}^{steps }\left( A c c_{TS}^i-A c c_{Random}^i\right) \end{aligned}$$
(10)

where steps represent the total number of retraining steps. \(Acc_{TS}^i\) refers to the accuracy of the model retrained using the data selected by the test selection metric TS. \(Acc_{Random}^i\) refers to the accuracy of the model retrained using the data selected by the random selection approach. imp represents the effectiveness improvement of the test selection metric TS over random selection.

In Table 7, we see that some test selection methods, such as DeepGini and MCP, along with GraphPrior, perform better than the baseline (random selection) in the majority of cases. Specifically, in approximately 61% of the cases, the evaluated test selection methods perform better than random selection. The top three best-performing methods are GraphPrior, Margin Sampling, and MCP. GraphPrior achieves the best performance in 50% of the cases, MCP in 21.43% of the cases, and Margin Sampling leads in 14.29% of the cases.

However, despite improvements, the extent to which test selection methods improve GNN model accuracy compared to the baseline is slight. Figure 5 offers a visual representation of the effectiveness of various test selection methods. In this figure, the blue line denotes the baseline - random selection. We see that the majority of uncertainty-based test selection methods, as well as GraphPrior, only show minor improvements over the baseline (random selection), with some methods even performing worse than random selection. However, a previous study on DNN test selection (Hu et al. 2021) demonstrated that some uncertainty-based test selection metrics, such as Margin and MCP, can consistently exhibit strong performance. However, these metrics do not achieve consistently strong performance in GNN test selection.

Below, we provide some potential reasons why some test selection methods (e.g., margin and MCP) are effective in DNNs but exhibit only small improvements over the baseline approach when applied to GNNs.

  • Inadequate representativeness The inputs selected through test selection methods aimed at misclassification detection are typically samples that are more likely to be misclassified. These inputs can be specific in the feature space, representing only a small part of the data distribution. They cannot be sufficient to represent the complex structure and diversity of the entire graph, thus affecting the effectiveness of retraining.

  • Differences in data structure DNNs typically process data where each sample is independent of others, and retraining the model does not require considering the relationships between samples. In contrast, in graph data processed by GNNs, nodes (i.e., samples) are interconnected through edges. Therefore, the information of a node depends not only on its own features but also on its neighboring nodes and the overall structure of the graph. When test selection methods from DNNs are applied to GNNs, these methods cannot adequately capture and utilize the complex interdependence of graph data for retraining.

  • Differences in Learning Mechanisms GNNs update node representations by aggregating information from neighboring nodes, which differs from the working mechanism of DNNs. Therefore, the reason for the misclassification of a node can be not only due to the features of the node itself but could also involve information from its neighboring nodes. Simply selecting these misclassified inputs for retraining ignores the crucial information from their neighbors.

figure r

5.4 RQ4: Node Importance-Based Test Selection for GNN Performance Enhancement

Objectives: We assess the effectiveness of node importance-based test selection methods in improving GNN accuracy during retraining. This investigation is driven by several factors: 1) Nodes with high importance typically encapsulate critical information and have a more significant impact on the overall graph (Park et al. 2019). Consequently, these nodes are more likely to capture essential information that is crucial for enhancing model performance; 2) Unimportant nodes may contain noise or irrelevant data that could introduce interference during retraining, potentially leading to a reduction in model performance; 3) Node importance is a distinctive data feature in GNNs that can be leveraged for the selection of critical tests. Currently, there is a gap in research regarding whether node importance can effectively guide the selection of retraining inputs. Therefore, it is imperative to conduct relevant studies in this area.

Experimental Design: In the initial step, we evaluated the initial accuracy of the target GNN model. Subsequently, we ranked all tests in the test set by importance, using each node importance metric. Based on each metric, we selected the top important tests, ranging from 10% to 80%, and then proceeded to retrain the original GNN model. We recorded the model’s accuracy after each round of retraining.

Results: The experimental results for RQ4 are presented in Table 8. Here, we have shaded in gray the approach with the highest effectiveness for each case. We see that, in the majority of cases, test selection methods based on node importance exhibit limitations when selecting inputs for retraining GNN models to improve accuracy. These methods tend to perform less effectively than random selection. Specifically, random selection outperforms node importance-based methods in 75% of the cases. Conversely, node importance-based test selection methods excel in only the remaining 25% of cases. Some potential factors that can lead to the low performance of node importance-based methods include:

  • Lack of Diversity Node importance methods can select a group of similar or closely related nodes, potentially resulting in a lack of diversity in the selected data. In contrast, randomly selecting nodes can introduce greater diversity, thereby enhancing the model’s ability to generalize.

  • Overfitting If the nodes selected by node-importance methods are overly specific or concentrated in a particular area, the model can be prone to overfitting to these selected nodes. Randomly selected nodes, on the other hand, can provide a more varied set of information, contributing to mitigate overfitting

  • Noise Tolerance Occasionally, incorporating some noisy or less significant nodes can potentially enhance the model’s robustness. Randomly selected nodes can introduce such beneficial noise.

Table 8 Effectiveness of node importance-based test selection approaches with respect to random selection (baseline) in selecting retraining inputs to improve GNN accuracy
figure t

6 Threats to Validity

Threats to Internal Validity. The internal threats to validity primarily stem from the implementation of the evaluated test selection approaches. To mitigate this threat, we implemented these approaches using the widely adopted PyTorch library and utilized the implementations of the compared approaches as provided by their respective authors. Another internal threat arises from the selection of clustering algorithms. The effectiveness of test selection can be influenced by the performance of the selected clustering algorithm. To mitigate this threat, we utilized established frameworks in our study. We opted for the widely adopted scikit-learn framework (Pedregosa et al. 2011) to implement the clustering algorithm. Scikit-learn is renowned for its robust performance and extensive user community.

Threats to External Validity. The primary external threats to the validity of our study are closely linked to two key aspects: the GNN models under evaluation and the test datasets used in our research. These factors can significantly impact the generalizability and applicability of our findings. To mitigate these potential threats, we made a conscious effort to include a large and diverse set of subjects (pairs of datasets and models) in our study. These subjects represent different combinations of GNN models and test datasets, ensuring that our analysis covers a wide spectrum of scenarios. Firstly, we recognized the critical role of dataset diversity and comprehensiveness in evaluating the efficacy of test selection approaches. We utilized seven prevalent graph datasets, encompassing not only node classification datasets but also graph classification datasets. This deliberate selection allows us to account for various problem domains, thereby enhancing the robustness and adaptability of our study to a multitude of GNN applications. Beyond dataset diversity, the choice of GNN models is pivotal in gaining insights into how test selection methods interact with different model architectures. To this end, we utilized a set of eight distinct GNN models, each possessing its unique characteristics and capabilities. These models span a spectrum of complexity and sophistication, ranging from simpler models to more advanced ones.

7 Related Work

We present the related works from three perspectives: DNN test selection, DNN Testing, and Empirical study on active learning.

7.1 DNN Test Selection

To tackle the challenge of labeling costs, test selection (Aghababaeyan et al. 2023b) has emerged as a practical solution.

In terms of misclassification detection, Ma et al. (2021) conducted an evaluation of various test selection methods tailored for misclassification detection, including coverage-based, surprise adequacy-based, and confidence-based approaches. Experimental results demonstrated that confidence-based metrics exhibit a robust ability to identify misclassified inputs, surpassing both the surprise adequacy-based and coverage-based test selection approaches. Hu et al. (2021) conducted an empirical evaluation of 15 active learning metrics to determine their effectiveness in selecting inputs for retraining DNNs. Their research demonstrated that the choice of data selection metrics can significantly influence the quality of the resulting model when using active learning for training.

Kim et al. (2019) proposed the Surprise Adequacy Criteria (SADL) for DNN test selection. SADL operates by extracting intermediate outputs from both the test and training data of DNNs, treating them as features, and then evaluating the surprise adequacy based on the dissimilarity between these features. In this process, two measurements are utilized: Likelihood-based Surprise Adequacy (LSA) and Distance-based Surprise Adequacy (DSA). LSA employs kernel density estimation to compute the dissimilarity, while DSA directly utilizes Euclidean distance. Despite the effectiveness of SADL in the context of DNNs, SADL cannot be directly applied to GNNs. This is because implementing SADL requires measuring the distance between the targeted test inputs and training inputs. However, their method for measuring distance is specifically designed for image/text data, which cannot be directly applied to graph-structured data.

Wang et al. (2021) proposed PRIMA for DNN test prioritization, which identified and prioritizes potentially misclassified test inputs based on intelligent mutation analysis. Despite its effectiveness in the context of DNNs, PRIMA is not suitable for GNNs.This is because PRIMA’s mutation operators are not adapted to graph-structured data and GNN models.

In terms of accuracy estimation, Li et al. (2019) introduced the CES (Cross Entropy-based Sampling) method to tackle this challenge. CES accomplishes test selection by minimizing the cross-entropy between the selected subset and the original test set, ensuring that the distribution of the selected test inputs closely matches that of the original test set. Chen et al. (2020) proposed the PACE, which employs a range of techniques to perform test selection, including clustering, prototype selection, and adaptive random testing. The process begins by categorizing all test inputs into different groups based on their testing characteristics. Subsequently, PACE utilizes the MMD-critic algorithm (Kim et al. 2016) to identify prototype test inputs from each group. For test inputs that do not fit into any specific group, PACE employs adaptive random testing to select representative tests.

Our empirical study focuses on evaluating test selection approaches across four areas: 1) Misclassification Detection, 2) Accuracy Estimation, 3) Performance Enhancement guided by confidence-based approaches, and 4) Performance Enhancement guided by node importance-based approaches.

Regarding misclassification detection and performance enhancement, our emphasis has been on evaluating confidence-based approaches due to the following reasons: 1) prior studies (Ma et al. 2021) have demonstrated that confidence-based methods outperform coverage-based and surprise-based approaches in terms of effectiveness; 2) Confidence-based test selection methods are widely recognized as the most efficient and straightforward to implement (Weiss and Tonella 2022), with runtime of less than 1 second in most cases.

7.2 Deep Neural Network Testing

In addition to test selection, the field of DNN testing (Jahangirova and Tonella 2020; Zolfagharian et al. 2023; Aghababaeyan et al. 2023a) encompasses various noteworthy research directions, with one notable focus being the assessment of DNN adequacy. Pei et al. (2017) introduced the concept of “neuron coverage" as a metric for gauging the comprehensiveness of a test set in terms of its coverage of a DNN model’s logic. They employed this metric to propose a white-box testing framework tailored for DNNs. In a subsequent study, Ma et al. (2018) introduced DeepGauge, a set of coverage criteria designed to evaluate the adequacy of tests applied to DNNs. DeepGauge also placed significant emphasis on neuron coverage as a valuable indicator of test input effectiveness. Additionally, they introduced novel metrics with varying levels of granularity to distinguish between adversarial attacks and legitimate test data. Kim et al. (2019) contributed to this area by introducing “surprise adequacy" as a measure for testing DL models. This approach evaluates the effectiveness of a test input by quantifying the surprise it generates concerning the training set. Specifically, the surprise of a test input is determined by measuring the difference in the activation values of neurons when exposed to this new test input.

7.3 Empirical Study on Active Learning

Active learning has been a subject of extensive research in recent years, with empirical studies spanning various domains. Yu et al. (2018) conducted empirical research that focused on active learning techniques for literature reviews. In their work, they cataloged and refined three state-of-the-art active learning methods derived from evidence-based medicine and legal electronic discovery. This effort led to the development of a novel active learning approach designed for the analysis of large document corpora, incorporating and fine-tuning the most effective active learning algorithms. Chen et al. (2006) delved into the effectiveness of active learning in the context of word sense disambiguation. They examined the behavior of active learning by considering two fundamental data selection metrics: entropy and margin. Sassano (2002) explored the practical application of active learning with Support Vector Machines in a challenging natural language processing task, providing insights into its performance in complex scenarios. Furthermore, Weiss and Tonella (2022) conducted a comprehensive investigation into various active learning techniques, revealing that confidence-based methods delivered surprisingly strong results when applied to DNNs.

8 Conclusion

In this paper, we conducted a comprehensive empirical study to explore the limitations of test selection approaches in the context of GNNs. We totally evaluated 22 test selection approaches based on 7 graph datasets and 8 GNN models. The results reveal that test selection approaches do not exhibit the same level of effectiveness when applied to GNNs in comparison to DNNs. More specifically, we draw the following conclusions: 1) Confidence-based test selection methods, which perform well in DNNs, do not yield the same level of effectiveness in detecting potentially misclassified tests for GNNs; 2) In the majority of cases, clustering-based test selection methods that utilize the model’s confidence vector perform better than random selection. However, their improvements compared to random selection are slight; 3) In terms of performance enhancement, both confidence-based and clustering-based test selection methods show only slight effectiveness; 4) Node importance-based test selection methods are unsuitable for selecting retraining data to enhance GNN accuracy.