1 Introduction

Whereas deep neural networks have recently achieved impressive results in various tasks, such networks require lots of training data as fuel to activate their full potential. At the same time, data scarcity is an intrinsic characteristic in many applications, effectively limiting the performance of purely data-driven neural models. In particular, this article addresses the problem of highly imbalanced node classification on large graph datasets. This problem is inherent in many critical real-world applications, including spam detection in reviews (Rayana & Akoglu, 2015) and fraud detection in financial networks (Weber, Domeniconi, Chen, Weidele, Bellei, Robinson, and Leiserson, 2019).

Fig. 1
figure 1

An excerpt of a Bitcoin transaction graph. The nodes represent transactions and their properties. The edges represent the payment flow of Bitcoins. The rule vectors indicate the weights of class-specific patterns extracted by the REFUEL approach

Imbalanced node classification is particularly challenging because of several factors. First, the aim is to predict the minority class nodes accurately, where the number of the minority class examples is substantially less than the number of examples in other classes. In these settings, data-driven neural node classification models do not provide satisfactory results due to the training data shortage. Second, the minority classes can be characterized by specific real-world patterns (e.g., money laundering schemes in financial transaction graphs). Whereas some of these patterns can be known to the experts, exhaustive prior knowledge of such patterns is generally unavailable, and manual labeling is costly.

Recently, learning from imbalanced data has been tackled in the context of graph node classification (e.g., Wang et al. (2021), Zhao et al. (2021)). For example, GS ( Zhao et al. (2021)) aims to generate synthetic minority class nodes to balance the node number. However, such synthetic nodes change the graph structure, adding noise and potentially altering the class-specific patterns. The RECT method (Wang et al., 2021) optimizes the graph representation in a Graph Convolutional Network (GCN) model. However, RECT does not consider edge weights, leading to over-smoothing on large, highly connected graphs. As observed in our experiments, these approaches do not generalize well across datasets.

Figure 1 illustrates an excerpt of a Bitcoin transaction graph inspired by the Elliptic dataset for illicit node detection (Weber, Domeniconi, Chen, Weidele, Bellei, Robinson, and Leiserson, 2019). Each node represents a Bitcoin transaction T and has attributes representing transaction properties, such as transaction participants, volume, timestamp, and fees. The edges represent the payment flow of Bitcoins. For example, an edge from the transaction \(T_1\) to \(T_2\) means that \(T_2\) receives a Bitcoin previously processed in \(T_1\). In this dataset, 2% of nodes are labeled illicit. The aim is to detect the nodes representing illicit transactions in the highly imbalanced graph (i.e., most transactions are licit).

In this article, we present the novel REFUEL approach that tackles the problem of imbalanced graph node classification by exploiting the complementary strengths of symbolic and neural learning paradigms in a novel way. To better detect the minority class nodes in the highly imbalanced graph, we propose to enrich the original graph nodes with rule vectors. The rules represent class-specific patterns explicitly. Examples of such rules for the minority class in the Elliptic datasetFootnote 1 include transactions with particularly high-volume or unusual participation patterns:

  • \(r_1(T)\): T.Volume > 7 BTC,

  • \(r_2(T)\): T.Number of senders \(\ne\) T.Number of receivers.

Our proposed REFUEL approach extracts such rules automatically. Rule extraction can help to extract rare patterns that can remain undetected by neural networks in data-scarce settings, such as imbalanced node classification. REFUEL relies on a novel rule extraction method that combines rule generation with random forest, rule selection, and an neural-based rule refinement mechanism to extract the rules explicitly capturing the class semantics. REFUEL adopts the resulting rule vectors to augment the graph node representation, further adopted in a neural classification architecture.

In summary, our contributions are as follows:

  • We propose REFUEL—a novel method based on rule generation and neural mechanisms to extract and refine symbolic knowledge representing characteristic patterns of the classes in highly imbalanced graphs.

  • We introduce a novel approach that efficiently combines symbolic learning with neural networks. We leverage the complementary strengths of these paradigms by integrating automatically derived symbolic knowledge into a Graph Attention Network (GAT)-based neural node embedding network.

  • We experimentally demonstrate that REFUEL substantially outperforms state-of-the-art imbalanced node classification baselines on three real-world datasets regarding precision and F1 score. REFUEL improves by at least 4% points in precision on the minority classes of 1.5–2% and generalizes across the datasets better than the baselines.

We make our code available to facilitate the reproducibility of results and encourage further research.Footnote 2

2 Problem statement

In this section, we provide relevant definitions and the problem statement addressed in this article. An overview of the frequent notations is provided in Table 1. We represent real-world entities, properties, and relations as a directed node-attributed graph.

Definition 1

(Node-attributed Graph) Let \(G = (V, E, A, {\textbf{X}})\) be the node-attributed directed graph, where V denotes the set of n nodes, and E represents the set of edges. An edge \(e \in E\) is an ordered pair \(e = (u,v)\), where \(u,v \in V\). A denotes the set of d node attributes. \({\textbf{X}}\) is the \(n\times d\) node-attribute matrix, where \(\mathbf {x_v}=[x_{v_1},\ldots ,x_{v_d}]\) is the attribute vector for the node v, and \(x_{v \alpha }\) denotes the value of the attribute \(\alpha \in A\) for the node \(v\in V\).

An example of a node-attributed graph is illustrated in Fig. 1. In this example, nodes represent Bitcoin transactions, node attributes represent transaction properties, and the edges represent payment flows.

In this article, we tackle the problem of imbalanced binary node classification in node-attributed graphs. In particular, we aim to learn a classification function to detect the nodes belonging to the minority class, denoted \(C_1\).

Definition 2

(Imbalanced Binary Node Classification) Given a graph \(G = (V, E, A, {\textbf {X}})\), we aim to learn a classification function that predicts node labels: \(c:V \rightarrow L\). Here, \(L \in \{0, 1\}\) denotes the set of labels, such that:

$$\begin{aligned} c(v)= {\left\{ \begin{array}{ll} 1, &{} \text {if }v \in V\text { is an instance of the minority class }C_1\\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

In the context of imbalanced node classification, we expect that the number of nodes in the class \(C_1\) is substantially smaller than the number of other nodes: \(|\{c(v)\rightarrow 1 \}| \ll |\{c(v)\rightarrow 0\}|\).

Table 1 The symbols and notations employed in this article

3 The REFUEL approach

Fig. 2
figure 2

Overview of the proposed REFUEL approach. RG is explained in Sect. 3.1, RS in Sect. 3.2, and RR in Sect. 3.3. The node enrichment, embedding, and classification steps are explained in Sect. 3.4

The core idea of the proposed REFUEL approach is to mitigate the data imbalance problem by jointly leveraging symbolic and neural learning methods. REFUEL enhances identifying the minority class by enriching nodes with typical patterns specific to each class. As the patterns are not known in advance, the first part of the REFUEL approach is rule extraction (RE). REFUEL then enriches the graph’s nodes with the extracted rule vectors and adopts a neural approach to embed and classify the nodes. Whereas the set of the extracted rules is fixed after the rule generation (RG) and rule selection (RS) is completed, rule importance is learned by the rule refinement (RR) in the neural network. Figure 2 provides an overview of the REFUEL approach.

3.1 Rule generation with random forest (RG)

The graph nodes that belong to a class can follow specific recurrent patterns. For example, a pattern can represent a Bitcoin transaction with a particularly high volume. Such patterns can be represented as rules. Given a node \(v \in V\) in a graph, a rule can be applied to identify if the node follows the specific pattern.

Definition 3

(Rule) A rule \(r: V \rightarrow \{0, 1\}\) is a function that maps a graph node \(v \in V\) to a binary value. If \(r_t\) is 1, then the rule is satisfied, otherwise \(r_t\) is 0. \(R^t\) denotes the set of rules, where t is the total number of rules.

For example, we can apply the rule \(r_1\) from Sect. 1 to the transaction \(T_1\) in Fig. 1. As the volume of \(T_1\) exceeds 5 BTC, the rule is satisfied, i.e., \(r_1(T_1) \rightarrow 1\). Given an initial set of rules, \(R^t\), we augment each graph node with a rule vector. This rule vector encodes the information we gain from the rule application to the node. As the rule relevance can vary, we assign a rule-specific weight \(w_r\) to the rule vector if the node satisfies the rule.

Definition 4

(Rule Vector) A rule vector for a node v is a vector \(\mathbf {b_v} = [b_{v1},\ldots , b_{vt}]\), such that

$$\begin{aligned} \mathbf {b_{vi}}={\left\{ \begin{array}{ll} w_r, &{} \text {if }r_i(v) = 1\\ 0, &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$

where \(w_r\) is the weight for the rule \(r_i \in R^t\).

In Fig. 1, we can observe that the rule vector for the Transaction \(T_1\) is \(\mathbf {b_{T_1}}= [0.8, 0.4]\), as \(r_1(T_1) \rightarrow 1\) and \(r_2(T_1) \rightarrow 1\).

To obtain the set of rules, we train a random forest algorithm (Pal, 2005) to distinguish between the classes of the nodes in the training data. random forest is an ensemble-based classification method that combines multiple decision trees. random forest is trained on the attribute vectors \(x_v\) of the nodes in the training set.

We adopt the decision nodes created by the random forest algorithm as an initial set of rules \(R^t\) in REFUEL. A univariate decision node \(r_z \in R\) generated by the random forest for a numerical attribute \(\alpha\) implements a function \(r_z(v) = \mathbbm {1}(x_{v\alpha } > \theta _{z})\), where \(\mathbbm {1}(\cdot )\) is an indicator function, \(x_{v\alpha }\) is the value of the attribute \(\alpha\) of the node v, and \(\theta _{z}\) is the threshold of the decision node \(r_z\).

We have chosen an ensemble-based random forest algorithm because of its robustness. Generating multiple decision trees based on randomized training data selection helps increase the attribute coverage and recall of relevant rules.

The decision nodes at the higher tree levels lead to a better generalization. We restrict the tree depth with pre-pruning to avoid highly specialized overfitted rules. The number of rules t depends on the number of trees in the random forest and the considered tree depth. The number of trees is experimentally set to 100 and the tree depth to five. The evaluation regarding the impact of the tree depth parameter is presented in Sect. 5.4.

3.2 Rule selection (RS)

We include the individual nodes of the decision trees in the rule set \(R^t\). We do not include more complex paths from the decision trees to allow the model to learn rule relevance to alleviate sparsity. In our experimental evaluation, we observed a superiority of the single decision nodes over the paths.

The initial set of rules \(R^t\) generated by the random forest algorithm can be large and contain multiple rules for the same attribute \(\alpha\) with different thresholds. A large rule set increases the computational complexity as each rule has to be applied to each graph node. Therefore, we apply a rule selection strategy to obtain fewer, more general rules. To reduce the number of rules in \(R^t\), we aggregate the rules that refer to the same attribute by computing the median threshold for each attribute. The median threshold is resistant to outliers and represents a real value-based threshold for the attribute.

In our experimental evaluation, we observed the superiority of the median against the mean. With this reduced set of rules \(R^{t'}\), \(t>>t'\), we form a rule vector \(\mathbf {b_v}\) according to the Definition 4 for each graph node v. Each dimension of a rule vector corresponds to a specific rule. We initialize the rule vectors with binary values, where one is assigned if a rule is satisfied and zero otherwise.

3.3 Rule refinement (RR)

A rule vector \(\mathbf {b_v}\) contains the results of applying the rules \(R^{t'}\) selected in the previous step to a graph node v, where each dimension corresponds to a rule. The rules in \(R^{t'}\) can have varying relevance for imbalanced classification. We transform rule vectors into a refined dense representation to learn such relevance.

In REFUEL, we refine the rule vectors inside REFUELs neural network by passing the rule vectors into fully connected (FC) layer. The input into the FC layer is the rule vector \(\mathbf {b_v}\). As a result, the layer produces a new set of values \(\mathbf {rb_v}\) with the dimension \(h = 64\) for each node. h corresponds to the number of output nodes in the FC layer and is chosen experimentally, as described later in Sect. 5.4. The output contains a dense numerical representation of the rule vector, referred to as an embedded rule vector, computed as:

$$\begin{aligned} \mathbf {rb_{vi'}} = \sum _i w_{ii'} \cdot \mathbf {b_{vi}} + bias, \end{aligned}$$

where i refers to the individual elements within the input rule vector, and \(i'\) refers to the elements of the embedded rule vector. \(\mathbf {b_{vi}}\) is the rule vector, bias is the bias, and \(w_{ii'}\) is the learned weight matrix. The FC layer facilitates the model to focus on the most relevant parts of the rule vector and improves its ability to handle varying rule relevance.

3.4 Node enrichment, embedding & classification

The graph node attributes can capture essential node properties and provide relevant information for the classification task. We enrich the node representation with the embedded rule vectors to enhance the node classification performance in imbalanced settings. Furthermore, we embed the node neighborhood to enhance the node representation within the graph context.

3.4.1 Node enrichment with embedded rule vectors

To utilize the knowledge captured in the embedded rule vector \(\mathbf {rb_v}\), we integrate this vector into the graph G. In particular, we concatenate \(\mathbf {rb_v}\) with the attribute vector \(\mathbf {x_v}\) of the respective node v to obtain an enriched node representation \(\mathbf {x{_v}'} = \mathbf {rb_v} \oplus \mathbf {x_v}\), where \(\oplus\) is the concatenation operator. We refer to the enriched graph as \(G' = (V, E, A, \mathbf {X'})\), with \(\mathbf {X'}\) being an enriched matrix \({\textbf{X}}\).

3.4.2 Node embedding

We embed the local graph structure in the node neighborhood to further enhance the information provided to the node classifier. Node embeddings aim to represent node attributes and neighborhood graph structures as a fixed-length latent vector representation.

Definition 5

(Node Embedding) Given the enriched graph \(G' = (V, E, A, {\textbf {X'}})\), let \(f: V \rightarrow {\mathbb {R}}^s\) be an embedding function that maps graph nodes to vectors in a s-dimensional latent vector space. For a node \(v\in V\), \(\mathbf {emb_v}=f(v)\) denotes the node embedding.

Recently, several node embedding techniques such as node2vec (Grover & Leskovec, 2016) and Graph Attention Network (GAT) (Velickovic et al., 2018) have been proposed. A state-of-the-art method to embed graph nodes is the GAT model proposed by Velickovic et al. (2018). Because of the self-attention mechanism, the method can weight the node neighborhoods’ importance during the learning process. We implemented the GAT with two multi-head GAT layers, as proposed in (Velickovic et al., 2018). The outputs of each head in the first layer are concatenated. In the second layer, the output is averaged. To stabilize the learning process, we experimentally determined the number of heads in each layer to be four. The first layer has \(d+t'\) many input nodes, and the output for each head was experimentally set to 32 values. The output of the second layer is the embedding \(\mathbf {emb_v}\) for the node v.

3.4.3 Node classification

The node embedding \(\mathbf {emb_v}\) is the input into a random forest classifier trained to classify the graph nodes. We utilize the default number of \(n_{trees}=100\).

3.5 REFUEL example

This section presents an example of how REFUEL works. We present REFUEL’s steps on the transaction \(T_2\) in Fig. 1. In the RG step, we apply a random forest classifier to all nodes V. We train the decision trees of the random forest classifier and extract all rules \(R^t\) from the decision nodes inside these trees. As we train 100 trees, the feature “volume” can have several rules with different thresholds, e.g., “\(\le\) 10 BTC”, “\(\le\) 7 BTC”, and “\(\le\) 5 BTC” in the set of trees. We select the median threshold for each feature in the RS step. In our example, this is “\(\le\) 7 BTC”. We apply this rule to the corresponding feature of each node. Therefore, in our example, the value in the rule vector \(b_v\) for this rule will be zero because \(T_2\) has a volume of 5 BTC, less than 7 BTC. As we select the median threshold for each feature, we have at most d rules, which leads to a rule vector with size t, where \(t \le d\). The rule vector in our example is \(b_v = [0,0]\). We then apply an FC layer in the RR step to this vector to obtain each node’s embedded rule vector \(rb_v\). We can construct the enriched node representation \(x_{v'} = x_v \oplus rb_v\) with the embedded rule vector. We train a GAT on this enriched node representation, which we optimize together with the FC layer for the rule vector on the classification task. A random forest classifies the trained node embeddings \(emb_v\) of size s in the last step of the REFUEL approach 3.4.3.

3.6 REFUEL runtime complexity

In this section, we discuss the runtime complexity of the proposed REFUEL approach. First, REFUEL adopts a random forest to generate rules with a runtime complexity of

$$\begin{aligned} {\mathcal {O}}(RG) = {\mathcal {O}}(n_{trees} * \sqrt{|\mathbf {x_v}|} * dp * |V|), \end{aligned}$$

where \(n_{trees}\) is the number of trees, \(\sqrt{|\mathbf {x_v}|}\) is the number of attributes considered in each split of the random forest, dp is the maximal depth of the trees, and |V| is the number of nodes. In total, the tree will contain at most \(n_{trees} * 2^{(dp + 1)}-1\) rules.

The next step is the rule selection. We calculate the medians of the thresholds of each attribute with a complexity of \({\mathcal {O}}(|\mathbf {x_v}|)\), where the number of selected rules can be at most the number of attributes \(|\mathbf {x_v}|\). We apply all selected rules on all node attributes to generate the rule vector \(\mathbf {b_v}\) with a complexity of \({\mathcal {O}}(|V|*|\mathbf {b_v}|)\). Thus, the complexity of the rule selection step is

$$\begin{aligned} {\mathcal {O}}(RS) = {\mathcal {O}}(|\mathbf {x_v}|) + {\mathcal {O}}(|V|*|\mathbf {b_v}|). \end{aligned}$$

The rule vector \(\mathbf {b_v}\) is used in the rule refinement. Training the FC layer has a complexity of

$$\begin{aligned} {\mathcal {O}}(RR) = {\mathcal {O}}(|\mathbf {x_v}|*|\mathbf {rb_v}|), \end{aligned}$$

where \(|\mathbf {rb_v}|\) is the size of the FC output layer.

The concatenation of the rule vector and the node attributes is then used in the node embedding step. We use two GAT layers with a time complexity of

$$\begin{aligned} {\mathcal {O}}(GAT) = {\mathcal {O}}((|\mathbf {rb_v}|+|\mathbf {x_v}|)*d'*|V| + |E|*d')), \end{aligned}$$

where \(d'\) is the output dimension of the GAT layers and |E| is the number of edges.

4 Evaluation setup

This section describes the datasets, evaluation metrics, and baselines. We implemented the GAT using the Deep Graph Library (Wang et al., 2019). The GAT model in the node embedding step of REFUEL is trained for 1000 epochs with the Adam Optimizer (Kingma & Ba, 2015) and a learning rate of 0.01. To implement the random forest classifiers, logistic regression, and multi-layer perceptron utilized as baselines, we used the scikit-learn (Pedregosa et al., 2011). The experiments were conducted on a 64-bit machine with six Nvidia GPUs (NVIDIA A40, 7251MHz, 48 GB).

4.1 Datasets

For evaluation, we adopt three real-world datasets.

The Elliptic Dataset introduced by Weber et al. (2019), is based on the Bitcoin blockchain and includes over 200k nodes and 234k edges. On average, the nodes have an in-degree of 2.06 with a standard deviation of 5.07. Each node contains 166 numeric anonymized transaction attributes, including transaction fee and output volume. An edge represents the usage of Bitcoins in the following transaction. 2% of the nodes are labeled as illicit, representing the minority class, while 21% are labeled as licit. For cross-validation, we split the dataset randomly into ten parts.

The Cora Dataset is a dataset containing a citation graph with the nodes categorized into 70 content-based categories (Bojchevski & Günnemann, 2018). The dataset contains 19,793 nodes representing publications with bag-of-words vectors of size 8,710 connected with 126,842 edges. On average, the nodes have an in-degree of 7.41 with a standard deviation of 8.79. We aim to analyze how our proposed method performs on different minority class sizes. In real fraud datasets, the minority class representing the fraud can include different fraud types, each with specific and distinct properties. By randomly picking classes from the Cora dataset and labeling them as minority classes, we obtain a structure similar to real-world fraud datasets. At the same time, using this procedure, we can construct minority classes of different sizes flexibly. We create subsets of the dataset with the minority class sizes of 1.5–2%, >2–5%, >5–7.5%, >7.5–10%, >10–12%, >12–14% and >14–16%. To create a minority class, we randomly add a node category until we reach the desired class size. We randomly select ten combinations for each size and perform cross-validation on the resulting splits.

The FraudYelp Dataset is a dataset collected by Rayana and Akoglu (2015). The dataset contains legitimate and spam reviews. There are 45,954 reviews with 32 attributes. Edges connect the reviews written by the same user in a graph. In total, the dataset contains 98,630 edges. On average the nodes have an in-degree of 2.15 with a standard deviation of 4.08. 85% are legitimate reviews, and 15% are spam. For cross-validation, we split the dataset randomly into ten parts.

4.2 Evaluation metrics

To evaluate the performance of a classifier on an imbalanced node classification problem, we assess the precision (P), recall (R), and F1 scores (F1) on the minority class and majority class, as well as the macro averages of these metrics over both classes. In most real-world applications of imbalanced node classification, like fraud detection, the cost of a false positive in the minority class is higher than in the majority class, as each prediction that indicates fraudulent transactions needs to be checked manually by an expert. Therefore, we consider precision on the minority class and the macro averages to be the most relevant metrics in the context of the imbalanced node classification. In addition, we report the F1 score that reflects the trade-off between precision and recall. We apply the 10-fold cross-validation and report averages for each metric over the folds.

4.3 Baselines and REFUEL variations

We evaluate the node classification performance of REFUEL against the traditional machine learning methods applied to imbalanced graph node classification tasks and state-of-the-art methods. Furthermore, we assess the impact of a variation of REFUEL with a state-of-the-art rule-extraction method in the ablation study.

Random forest (RF) is an ensemble-based classification method (Pal, 2005). As parameters, we use 50 trees and a maximum of 50 features to consider when splitting.

Logistic Regression (LR) is a supervised classification method (Dreiseitl & Ohno-Machado, 2002). We apply L2 regularization and iterate up to 100 times.

Multi Layer Perceptron (MLP) is an artificial neural network (Hinton, 1989). We utilize 50 neurons in the hidden layer, trained for 50 epochs with the Adam optimizer (Kingma & Ba, 2015).

Graph Convolutional Network (GCN) are commonly adopted in the node classification tasks as they can consider the graph structure ( Alarab et al. (2020), Weber et al. (2019)). The GCN adopted in this article consists of two convolutional layers and is optimized using the Adam optimizer and a learning rate of 0.001 (Kingma & Ba, 2015). For RF, LR, MLP, and GCN, we utilize the parameters Weber et al. (2019) have optimized for the Elliptic dataset.

Graph Attention Network (GAT) extends GCNs by employing an attention layer to assign weights to neighboring nodes during the information aggregation. The GAT adopted in this article consists of two multi-head GAT layers and is optimized using the Adam optimizer and a learning rate of 0.01 (Velickovic et al., 2018).

RElaxed GCN NeTwork (RECT) is a state-of-art graph neural network recently proposed by Wang et al. (2021) to address imbalanced node classification. The loss function of RECT consists of two parts: a prediction loss in the semantic space and a graph structure preserving loss. We utilize the same settings the authors proposed in the original paper: 100 training epochs using the Adam optimizer with a learning rate of 0.0001 (Kingma & Ba, 2015).

GraphSmote (GS) proposed by Zhao et al. (2021) is a state-of-the-art method tackling imbalanced graph classification by synthetic minority over-sampling in the latent space. GS aims to generate synthetic minority nodes through interpolation in an embedding space generated by a GNN-based attribute extractor. We apply the implementation provided by GS’s authors.

SIRUS is a rule extraction method introduced by Bénard et al. (2021) to find a stable set of rules. SIRUS restricts splits in the random forest to q-empirical quantiles with q=10 and selects the more frequent rules than an experimentally chosen threshold of 0.001. The selected rules are filtered by excluding linear combinations of rules with higher frequency. To evaluate the SIRUS rule extraction, we replaced the rule extraction of the proposed REFUEL approach with the SIRUS method in the ablation study. Note that SIRUS does not allow node classification and thus cannot serve as a stand-alone baseline.

5 Evaluation

The evaluation aims to assess the performance of REFUEL regarding node classification, particularly its ability to detect the nodes in the minority class precisely. Furthermore, we analyze the effectiveness and runtime of the REFUEL’s modules in an ablation study, conduct rule and attribute analysis, and parameter tuning.

5.1 Node classification performance

In this section, we assess the proposed REFUEL classification performance against the baselines for the Elliptic, Cora, and FraudYelp datasets. We demonstrate that the REFUEL outperforms the baselines on all the datasets regarding precision and F1 score on the minority class.

Table 2 summarizes the evaluation results on the minority and majority class of the Elliptic dataset, Table 5 shows the macro averages. As we can observe, the proposed REFUEL approach outperforms all the node classification baselines regarding all considered metrics on the minority class and the macro average, while providing competitive results for the majority class. REFUEL outperforms GAT, the state-of-the-art baseline best-performing on this dataset, by 4 p.p. regarding the precision on the minority class and by 2 p.p. regarding the macro-average precision. The results of the paired t-test confirm the statistical significance of this result for the confidence level of 95%. Among the baselines, GAT performs best regarding the precision and F1 score on the minority class, the macro-average precision, and the F1 score. RECT achieves the best recall on the minority class. RF performs best for the majority class regarding precision and F1 score. Since the Elliptic dataset has a graph structure, where a local node neighborhood can characterize the node’s properties, and GAT incorporates weighted neighborhood information, GAT performs better at predicting the minority class nodes than the machine learning methods only considering the node attributes. Since RECT incorporates class-semantic knowledge, RECT performs better than a simple GCN for the minority class, but GAT is even better because of the weighting of the neighborhood. The GS method performs worse on the Elliptic dataset than the other methods, probably because the dataset is less connected than the Cora dataset for which GS was originally proposed, which can result in meaningless synthetic nodes. Overall, the proposed REFUEL approach builds upon the strength of the GAT and RF. Conducting the node enrichment with the automatically extracted rules enables REFUEL to outperform the baselines by a large margin on the minority class.

Table 2 Classification performance of baselines and REFUEL approach on the Elliptic dataset

The results on the Cora dataset with a minority class size of 1.5–2%, comparable to the minority class in the Elliptic dataset, are presented in Tables 3 and 5. The proposed REFUEL approach outperforms the best-performing baseline, GS in this case. REFUEL achieves a 14 p.p. improvement in precision on the minority class and 7 p.p. on the macro-average precision, as well as an improvement in the F1 score. Regarding the baselines, GS, developed for this highly connected dataset, outperforms RECT. RECT does not consider edge weights, which leads to over-smoothing on the large, highly connected Cora graph. All baseline methods perform better in recall yet significantly underperform in the minority class precision, leading to lower F1 scores.

Table 3 Classification performance of baselines and REFUEL approach on the 1.5–2.0% Cora dataset

For the Cora dataset, we additionally perform experiments with different sizes of the minority class to analyze the impact of the class size. We compare REFUEL to GS, the best-performing baseline on this dataset. In Fig. 3, we illustrate the course of the precision on the minority class for both methods. We observe that REFUEL outperforms GS for all minority class sizes between 2 and 16%. The difference tends to decrease slightly with larger classes.

Fig. 3
figure 3

REFUEL and GS precision for the Cora dataset for different minority class sizes

To further evaluate REFUEL on larger minority classes, we apply REFUEL on the FraudYelp dataset. Tables 4 and 5 present the classification results. We can observe that REFUEL outperforms all baselines by at least 10 p.p. regarding precision on the minority class. Additionally, only RF achieves a higher recall than REFUEL. However, RF exhibits a lower F1 score, indicating a higher imbalance between precision and recall. RF achieves the highest precision, recall, and F1 score among the baselines. The FraudYelp dataset possesses a sparse graph structure with expressive node attributes, thereby reducing GS and RECT classification performance and giving RF a comparative advantage.

Table 4 Classification performance of baselines and REFUEL approach on the FraudYelp dataset
Table 5 Macro average for the node classification performance

Overall, these results confirm that REFUEL outperforms the baselines on all datasets and different class sizes regarding minority class precision and F1 score. REFUEL outperforms the baselines by a larger margin when the size of the minority class decreases. Thus, REFUEL is particularly suited for highly imbalanced classification problems.

5.2 Ablation study

In the ablation study, we compare the results of the REFUEL approach with the ablation configurations that exclude the rule extraction (NoRE), the rule selection (NoRS), and the rule refinement (NoRR). Furthermore, we evaluate a REFUEL variation with a SIRUS-based configuration, replacing the rule extraction method of REFUEL with a state-of-the-art SIRUS method for rule extraction. We use the Elliptic dataset for the ablation study. Table 6 presents the evaluation results. In this table, time represents the number of seconds for training and testing for one data fold. The size of \(rb_v\) denotes the size of the rule vector in the node enrichment step.

Table 6 Node classification performance on the Elliptic dataset in the ablation study

As we can observe, REFUEL, including all proposed components, achieves the highest precision of 0.9 for the minority class. Furthermore, REFUEL achieves the best runtime results among all configurations that adopt rule extraction.

The rules we extract with SIRUS lead to a reduced precision in the minority class by 3 p.p. compared to REFUEL. The SIRUS approach excludes many attributes in less frequent rules from the final rule set. The remaining four rules fail to capture more complex patterns of the minority class. As the SIRUS rule generation is more complex than REFUEL’s, the model’s runtime with SIRUS is 2.25 times higher compared to REFUEL despite the small size of the rule vector.

The NoRE setting does not apply any rule extraction, leading to the lowest precision. This setting is associated with the lowest runtime, as expected. In the NoRS setting, the recall for the minority class increases further at the cost of lower precision and increased runtime.

The NoRS &NoRR setting increases the recall and leads to the best F1 score at the cost of lower precision on the minority class. In this setting, the size of the rule vector increases by 79.3 times, leading to a 13.5 times higher runtime compared to REFUEL. We can observe that RS and RR steps complement each other. Following the rule generation step (RG), more important attributes occur frequently in the rule set. The rule selection step (RS) substantially reduces the initial rule set, leading to a one-to-one correspondence between the rules and attributes. However, due to the equal weighting of the rules, this reduction in the number of rules results in a lower performance than the NoRS & NoRR setting, where we directly use all generated rules. The rule refinement step (RR) aims to enhance the representation of rule significance. When we involve all rules generated in the RG step in the rule refinement, the weights generated in the RR step disproportionately favor the rules associated with a few highly important attributes. This can lead to low weights for the rules that characterize the minority classes. In REFUEL we resolve this problem by consolidating the rules that refer to the same attribute in the RS step. Overall, we achieve the best results when jointly involving the proposed rule extraction and rule refinement steps.

5.3 Rule and attribute analysis

In this section, we further analyze and compare the rules extracted by the REFUEL approach and the SIRUS rule generation method and the attributes in these rules. For the Elliptic dataset, the rule generation step of REFUEL extracts, on average, 5057 rules for 160 data attributes. On average, REFUEL selects the following attributes as split attributes more than 100 times in the random forest (note that due to the dataset anonymization, attribute names are represented as integers): 55 (147x), 53(146x), 43 (125x), 49 (120x), 41 (119x), and 47 (112x). On average, REFUEL selects an attribute 30 times as a split attribute. Selecting the median threshold reduces the rules to 160 attributes on average in the rule selection step. In contrast, SIRUS requires a minimum frequency of the rules, so SIRUS only includes four rules with ten attributes. On average, SIRUS includes attribute 5 four times in each fold and attribute 47 thrice.

Fig. 4
figure 4

Box plots: the distribution of values for the most frequently selected attributes by REFUEL to generate the rules for the Elliptic dataset (Y-axis on symlog scale)

Figure 4 presents the distribution of the values for the attributes most frequently selected for rules as box plots. As we can observe, for the attributes 43, 49, 41, and 47, the minority class nodes for the attributes in Fig. 4b are predominantly near the median of the distribution. For the two most frequently selected attributes, 55 and 53, we can observe that the minority class nodes contain more outliers than other features. As the attributes of the Elliptic dataset are anonymized, we cannot provide an intuitive attribute description here. Nevertheless, we expect that the rules created based on the datasets with known attributes and the weighting of such rules can provide intuitive insights into the distinctive patterns of the minority class.

5.4 Parameter tuning

Two essential REFUEL parameters are the tree depth in the rule generation and the number of output nodes of the FC layer in the RR step. We perform a grid search for the REFUEL parameter to obtain the best minority class precision while reducing the number of rules, as increasing the number of rules increases the runtime of REFUEL.

Figure 5a illustrates the precision results on the minority class dependent on the tree depth on the Elliptic dataset. The precision increases with increasing tree depth and has a peak with a depth of five. For deeper trees, the number of rules increases. Therefore, we select a tree depth of five as the parameter in our experiments.

Figure 5b displays the influence of the number of the FC layer nodes h on minority class precision. We can observe a precision peak with 64 nodes; we use this number in our experiments. With more nodes, the layer loses its capability to weight the rules appropriately. With fewer nodes, the method misses important rules.

Fig. 5
figure 5

Parameter Tuning for REFUEL’s RE

6 Related work

In this section, we discuss related work in the areas of imbalanced graph node classification, combining rule-based and neural methods and rule extraction.

6.1 Imbalanced graph node classification

Learning from imbalanced data is an inherently challenging problem in many real-world application domains and affects various machine learning algorithms (Krawczyk, 2016). Recently, learning from imbalanced data has been tackled in the context of graph node classification (e.g., Qu et al. (2021), Zhao et al. (2021), Wang et al. (2021)).

GraphSmote (Zhao et al., 2021) and ImGAGN (Qu et al., 2021) are data-level methods aiming to generate synthetic minority nodes to balance the number of nodes. ImGAGN is a GANs-based method that learns the nodes’ attribute distribution and the network topological structure. GraphSmote generates synthetic minority nodes through interpolation in an embedding space generated by a GNN-based attribute extractor. We adopt GraphSmote as a baseline in our evaluation. In contrast, REFUEL does not generate synthetic nodes but extracts rules representing semantic class patterns and adopts these rules to enrich existing nodes.

The RECT method proposed by Wang et al. (2021) is an algorithm-level method. The RECT model is a graph neural network that uses GCN layers to explore the graph structure and adds an embedding optimized for class-semantic information. RECT does not consider edge weights, which leads to over-smoothing on large, highly connected graphs such as the graph in the Cora dataset. We experimentally demonstrate that REFUEL outperforms RECT on real-world imbalanced node classification.

6.2 Combining rule-based and neural methods

Despite the recent success of deep neural networks (DNNs) in many application areas, DNNs have various areas for improvement. In particular, DNNs require lots of training data and computational capacity, cannot directly utilize symbolic domain knowledge, and lack explainability. In contrast, symbolic approaches such as decision trees (Sharma et al., 2016) learn interpretable models but may not fully capture complex patterns (Ribeiro et al., 2018). Therefore, combining symbolic approaches and neural networks has recently attracted research attention. One prominent application in the context is explainability research. For example, Ribeiro et al. (2018) adopt if-then rules in post hoc local explanations to describe the decision boundaries of a DNN. Sendi et al. (2019) use rules extracted from a DNN to enhance the prediction performance in an ensemble method. In contrast, we aim to introduce domain knowledge inside neural classification models to assist an imbalanced node classification. REFUEL extracts such knowledge, represented as rules, and refines the rules with a neural mechanism trained on the downstream classification task.

6.3 Rule extraction from decision trees

Random forest is an ensemble-based classification method that combines multiple decision trees. A decision tree can be written as a set of if-then rules, e.g., using the C4.5 Rules algorithm ( Quinlan (1993)). Pruning methods can be applied to reduce the set of rules. Recent examples of rule extraction from decision trees include rule pruning to obtain transparent rules for medical applications (e.g., Boruah et al. (2022)). Decision trees have also been adopted to learn simpler models imitating a black box model (e.g., Craven and Shavlik (1995)). Bologna (2021) apply a pruning strategy to the decision tree rules, where the result with the reduced rules should be as close as possible to the decision tree. Bénard et al. (2021) proposed SIRUS, a state-of-the-art method that extracts all rules from a random forest and filters them based on frequency and similarity. In contrast, REFUEL extracts the rules to increase the coverage of the relevant attributes. REFUEL refines the rule representation with a novel neural-based mechanism trained directly on the downstream node classification task.

7 Conclusion

In this article, we proposed REFUEL—a novel approach for highly imbalanced node classification problems in graphs. REFUEL combines the power of symbolic and neural methods in a novel neural-based rule-extraction architecture. REFUEL conducts rule extraction and refinement by utilizing the random forest algorithm and a FC layer, and augments the graph nodes with the extracted rule vectors. Then, REFUEL adopts a graph attention network for the node embedding and classification. Our evaluation confirms the effectiveness of REFUEL for the imbalanced node classification task on three real-world datasets. REFUEL outperforms the baselines by at least 4% points in precision on the minority classes of 1.5–2%.