Graph-Based LSTM for Anti-money Laundering: Experimenting Temporal Graph Convolutional Network with Bitcoin Data

Elliptic data—one of the largest Bitcoin transaction graphs—has admitted promising results in many studies using classical supervised learning and graph convolutional network models for anti-money laundering. Despite the promising results provided by these studies, only few have considered the temporal information of this dataset, wherein the results were not very satisfactory. Moreover, there is very sparse existing literature that applies active learning to this type of blockchain dataset. In this paper, we develop a classification model that combines long-short-term memory with GCN—referred to as temporal-GCN—that classifies the illicit transactions of Elliptic data using its transaction’s features only. Subsequently, we present an active learning framework applied to the large-scale Bitcoin transaction graph dataset, unlike previous studies on this dataset. Uncertainties for active learning are obtained using Monte-Carlo dropout (MC-dropout) and Monte-Carlo based adversarial attack (MC-AA) which are Bayesian approximations. Active learning frameworks with these methods are compared using various acquisition functions that appeared in the literature. To the best of our knowledge, MC-AA method is the first time to be examined in the context of active learning. Our main finding is that temporal-GCN model has attained significant success in comparison to the previous studies with the same experimental settings on the same dataset. Moreover, we evaluate the performance of the provided acquisition functions using MC-AA and MC-dropout and compare the result against the baseline random sampling model.


Introduction
Blockchain intelligence and forensics company CipherTrace have reported a global amount of $US 4.5 billion in 2019 of Bitcoin crime related to illicit services [1]. Money launderers exploit the pseudonym of Bitcoin ledgers by transforming the illegally obtained money from serious crimes into legitimate funds via Bitcoin network. On the other hand, Bitcoin blockchain has attracted intelligence companies and financial regulators who transact on the blockchain to be aware of its risks, such as technical developments in and societal adoption of the cryptocurrency Bitcoin [2]. The arising of illegal services and the public availability of Bitcoin data have urged the need to develop intelligent methods that exploit the transparency of the blockchain records [3]. Such methods can boost anti-money laundering (AML) in Bitcoin and enhance safeguarding cryptocurrency ecosystems. In the past few years, Elliptic company-a cryptocurrency intelligence company focusing on safeguarding cryptocurrency systems-has released a graph network of Bitcoin transactions, known as Elliptic data. This data has been a great support to the research and AML community in order to develop machine learning methods. Elliptic data acquires a graph of Bitcoin transactions that spans handcrafted local features (associated with transactions itself) and aggregated features (associated with neighbouring transactions) with partially labelled nodes. Furthermore, the labelled nodes denote licitly-transacted payments (e.g. miners) and illicit transactions (e.g. theft, scams). Previous researchers have attempted to apply this dataset to classical supervised learning methods [3,4], graph convolutional network (GCN) [3,5], EvolveGCN for dynamic graphs [6], signature vectors in blockchain transactions (SigTran) model [7] and uncertainty estimation with multi-layer perceptron (MLP) model [8]. Despite the promising results achieved by these studies, the highest accuracy achieved considering only the set of local features is about 97.4% and f 1 -score of 77.3%. Furthermore, only a few have considered the temporal information of this dataset. On the other hand, Bitcoin blockchain is a large-scale data in which the labelling process of this data is very hard and time-consuming. Active learning (AL) approach tackles this problem by querying the labels of the most informative data points that attain high performance with less labelled examples. Using Elliptic data, the only work by Lorenz et al. [9] has presented an active learning solution which has shown the capability of matching the performance of a fully supervised model by using 5% of the labelled data. However, the preceded framework has been presented with classical supervised learning methods which do not consider the graph topology or temporal sequence of Elliptic data. In this paper, we aim: • To present a model in a novel way that considers the graph structure and temporal sequence of Elliptic data to predict illicit transactions that belong to illegal services in Bitcoin blockchain network. • To perform active learning on Bitcoin blockchain data that mimics a real-life situation, since Bitcoin blockchain is a massively growing technology and its data is time-consuming to label.
The presented classification model comprises long short-term memory (LSTM) and GCN models, wherein the overall model attains an accuracy of 97.7% and f 1 -score of 80% which outperform previous studies with the same experimental settings. On the other hand, the presented active learning framework requires an acquisition function that relies on model's uncertainty to query the most informative data. In this paper, the model's uncertainty estimates are obtained using two comparable methods based Bayesian approximations which are named Monte-Carlo dropout (MC-dropout) [10] and Monte-Carlo adversarial attack (MC-AA) [11]. We examine these two uncertainty methods due to their simplicity and efficiency where MC-AA method is the first time to be applied in the context of active learning. Hence, we use a variety of acquisition functions to test the performance of the active learning framework using Elliptic data. For each acquisition function, we evaluate the active learning performance that relies on each of MC-AA and MC-dropout uncertainty estimates. We compare the performance of the presented active learning framework against the random sampling acquisition as a baseline model. This paper is structured as follows: Section 2 describes the related work. Section 3 demonstrates the uncertainty estimation methods used by the active learning framework. Section 4 provides various acquisition functions to be examined in the experiments. Section 5 provides the methods used to perform the classification task. Experiments are detailed in Sect. 6 followed by the results and discussions in Sect. 7. An ablation study of the proposed model is given in Sect. 8. Section 9 states the conclusion to wrap up the whole methodology.

Overview of Related Work
With the appearance of illicit services in the public blockchain systems, intelligent methods have undoubtedly become a necessary need for AML regulations with the rapidly increasing amount of blockchain data. Many studies have adopted the machine learning approach in detecting illicit activities in the public blockchain. Harlev et al. [2] have tested the performance of classical supervised learning methods to predict the type of the unidentified entity in Bitcoin. Farrugia et al. [12] have applied XGBoost classifier to detect fraudulent accounts using the Ethereum dataset. Weber et al. [3] have introduced Elliptic data-a large-scale graph-structured dataset of a Bitcoin transaction graph with partially labelled nodes-to predict licit and illicit Bitcoin transactions. This dataset has been introduced by Weber et al. [3] who have discussed the outperformance of the random forest model against graph convolutional network (GCN) in classifying the licit and illicit transactions derived from the Bitcoin blockchain. Subsequently, the classification results using ensemble learning model in [4] have revealed a significant success over other benchmark methods to classify illicit transactions of Elliptic data. Also, Pareja et al. [6] have introduced EvolveGCN which is formed of GCN with a recurrent neural network such as Gated-Recurrent-Unit (GRU) and LSTM. This study has revealed the outperformance of EvolveGCN over the GCN model used by Weber et al. [3] on the same dataset. Another work in [5] has considered the neighbouring information of the Bitcoin transaction graph of Elliptic data using GCN accompanied by linear hidden layers. Without utilising any temporal information from this dataset, the latter reference has achieved an accuracy of 97.4% outperforming the GCN based models that were presented in [3,6].
Active learning, a subfield of machine learning, is a way to make the learning algorithm choose the data to be trained on [13]. Active learning mitigates the bottleneck of the manual labelling process, such that the learning model queries the labels of the most informative data. Since it is so expensive to obtain labels, active learning has witnessed a resurgence with the appearance of big data where large-scale datasets exist [14]. Lorenz et al. [9] have presented an active learning framework in an attempt to reduce the labelling process of the large-scale Elliptic data of Bitcoin. The presented active learning solution has shown its capability in matching the performance of a fully supervised model with only 5% of the labels. The authors have focused on querying strategies based on uncertainty sampling [13,15] and expected model change [13,16]. For instance, the used uncertainty sampling strategy is based on the predicted probabilities provided by the random forest in [9]. Yet, no study presents an active learning framework that utilises the recent advances in Bayesian methods on Bitcoin data. On the other hand, Gal et al. [17] have presented active learning frameworks on image data where the authors have combined the recent advances in Bayesian methods into the active learning framework. This study has performed MC-dropout to produce the model's uncertainty which is utilised by a given acquisition function to choose the most informative queries for labelling. Concisely, the authors in [18] have applied the entropy [19], mutual information [20], variation ratios [21], and mean standard deviation (Mean STD) [22,23] acquisition functions which are compared against the random acquisition.
In this study, we conduct experiments using a classification model that exploits the graph structure and the temporal sequence of Elliptic data derived from the Bitcoin blockchain. Motivated by the studies in [9,17], we perform the active learning frameworks, using pool based-based scenario [13] in which the classifier iteratively samples the most informative instances for labelling from an initially unlabelled pool. For each iteration, the classifier samples a batch of unlabelled data points according to their uncertainty estimates from Bayesian models using the sampling acquisition function.

Model Uncertainty: MC-Dropout Versus MC-AA
The two major types of uncertainty in a machine learning model are epistemic and aleatoric uncertainties [24]. Epistemic, also known as model uncertainty [10], is induced from the uncertainty in the parameters of the trained model. This uncertainty is reducible by training the model on enough data. Aleatoric uncertainty is the uncertainty tied with the noisy instances that lie on the decision boundary or in the overlapping region for class distributions, and therefore it is irreducible. MC-dropout has gained popularity as a prominent method in producing the two types of uncertainties [10]. Although MC-dropout is easy to perform and efficient, this method has failed, to some extent, to capture data points lying in the overlapping region of different classes where noisy instances reside [11]. The latter reference has provided an uncertainty method that is capable to reach noisy instances with high uncertainty estimates. This method is so-called MC-AA which targets mainly the instances that fall in the neighbourhood of a decision boundary. Although MC-dropout and MC-AA are both simple and promising methods, MC-AA has provided more reliable uncertainty estimates in [11]. In the light of these studies, we utilise these uncertainty methods as a part of the active learning process.

MC-Dropout: Monte-Carlo Dropout
Initially, dropout has been provided as a simple regularisation technique that reduces the overfitting of the model [25]. The work in [10] has MC-dropout as a probabilistic approach based on Bayesian approximation to produce uncertainty estimates. MC-dropout uses dropout after every weight layer in a neural network. Uncertainty estimates are produced by activating dropout during the testing phase by multiple stochastic forward passes wherein uncertainty measurement (e.g., mutual information) is computed.
Letŷ be an output of an input x mapped by a neural network, trained on set D train , with layers L and learnable weights w Consider y as an observed output associated with x. Then, we can express the predictive distribution as: where p(y|x, w) is the model's likelihood and p(w|D train ) is the posterior over the weights.
Since the posterior distribution is analytically intractable [10], the posterior is replaced by q(w), an approximation of variational distribution. q(w) is obtained from the minimisation of Kullback-Leibler divergence (KL) to approximately match p(w|D train ) as follows referring to: Hence, the variational inference leads to an approximated predictive distribution as: The work in [10] has chosen q(w) to be the distribution over the matrices whose columns are randomly set to zero for posterior approximation. Then, q(w) can be defined as: . These are obtained from T stochastic forward passes with active dropout during the testing phase of the input data. Then, the predictive mean can be expressed as: To obtain uncertainty, mutual information (MI) identifies the information gain of the outputs derived from Monte-Carlo samples over the predictions. Data points that reside near the decision boundary are more likely to acquire high mutual information referring to [8]. We can express mutual information as follows, referring to [10]: where c is the class label, and MC-dropout method can be viewed as an ensemble of multiple decision functions derived from the multiple stochastic forward passes. Precisely, it is an ensemble of multiple perturbed decision boundaries. As this method captures data points between different class distributions, a noisy point that falls in the wrong class cannot be captured by MC-dropout, since the latter method influences only the points with weak confidence. This is tackled in MC-AA method that is stated in the next part.

MC-AA: Monte-Carlo Based Adversarial Attack
Initially, adversarial attacks are introduced as crafted perturbations of the input in order to produce incorrect predictions [26], which affect the integrity of the model by the attackers. These attacks fall are categorised between white-box and black-box attacks. The former is when the attacker has access to the model's parameters, wherein the latter type accounts for using the model as a black box. White-box attacks are designed by adding perturbations to the inputs in the direction of the decision boundary formed by the model. These guided perturbations are the gradients of the loss function with respect to the input such that the input is assumed to belong to different class distribution. One of the methods used to compute the perturbations is known as FGSM (Fast Gradient Sign Method). Primarily, FGSM is proposed in [27] for attacking deep neural networks. This method is based on maximising a loss function J (x, y) in a neural network model with respect to a given input x and its label y. The aim of this method is to make the classifier perform poorly on the perturbed inputs as worse as possible. The perturbation of input by FGSM can be reformulated as follows: with where x ε is the adversarial example, ε is a small number and ∇ x is the gradient with respect to the input x. This method perturbs the given input in the opposite direction of the initial class towards the decision boundary. MC-AA is based on the idea of FGSM by computing multiple perturbed versions of an input in a small range [11]. This leads to multiple outputs that allow obtaining uncertainty. MC-AA can be viewed as ensemble learning of multiple decisions derived from the perturbed versions of an input in a back-and-forth manner in the direction of the decision boundary. Thus, any point falling on the decision boundary will reflect a high uncertainty. In MC-AA, the noisy labels are triggered to move in a small range, so that they are more likely to escape from their wrong class. Thus, the noisy labels will be assigned with some uncertainty. Moreover, this will further increase the number of correctly classified data points to be uncertain, which does not affect the model performance. More formally, consider a discrete interval I that is evenly spaced by β and symmetric at zero, then it can be expressed as follows: where ε T = ε max that is the maximum value in the interval I as a tunable hyper-parameter to perturb an input by FGSM. T is a pre-chosen interval size, and it is also the number of ensembles to be performed via MC-AA. Consider a neural network of weights w with function approximation as f : x →ŷ. Let y be the associated observation of x. Since the perturbations by MC-AA over x are applied on a very small range, we can use Taylor expansion up to order 1 to make approximations as follows: To compute the predictive mean p MC−AA (y|x), we find the average of the predictions of a given input as follows: This equation boils down to: Hence, we obtain an unbiased predictive mean by MC-AA, whereas several perturbations can be used to compute mutual information as the predictive uncertainty. Similarly, to Eq. 5, we estimate uncertainty of x using mutual information as follows: where c is the class label, and

Acquisition Functions for Active Learning
Pool-based active learning is a prominent scenario [13,28] that assumes a set of labelled data available for initial training D train and a set of unlabelled pool D pool in a Bayesian model M with model parameters w ∼ p(w|D train ) and output predictions p(y|w, D train ) for y ∈ {0, 1} in binary classification tasks. Then, the Bayesian model M that is already trained on D train queries the labels-from the unlabelled set D pool -of an informative batch with size b by an oracle in order to obtain an acceptable performance with less training data.
Consider an acquisition function a(x, M) that measures the score of a batch of unlabelled be the informative batch by the acquisition function which can be expressed as follows [29]: In what follows, we demonstrate various acquisition functions which are detailed in [24].

BALD: Bayesian Active Learning by Disagreement
Bayesian Active Learning by Disagreement (BALD) [20] is an acquisition method that utilises the uncertainty estimates via mutual information between the model predictions and model parameters. Hence, the learning algorithm queries the data points with the highest mutual information measurements. The highest mutual information measurements are produced when the predictions by Monte-Carlo samples are assigned with the highest probabilities where the samples are associated with different classes.
In this paper, we desire to acquire a batch of size b at each sampling iteration.
Using BALD, this can be expressed as: whereÎ is derived from Eq. 5 for MC-dropout and Eq. 11 for MC-AA. Furthermore, the optimal batch is the one with b-highest scoring data points to reduce the bottleneck of acquiring a single data point at each acquisition step.

Entropy
This acquisition method computes the entropy using the uncertainty estimates from Eqs. 6 and 12. During the active learning process, we choose the batch size with the maximum predictive entropy [19] which can be written as: The maximum entropy explains the lack of confidence within the obtained predictions which are typically near 0.5.

Variation Ratios
Similarly, we choose the batch with the maximum variation ratios [21] where the variation ratio is expressed as: The maximum variation ratios correspond to the lack of confidence in the samples' predictions.

Mean Standard Deviation
Likewise, we sample a batch that maximise the mean standard deviation (Mean STD) [22,23]. The predictive standard deviation can be computed as: where E corresponds to the expected mean. The σ c measurement computes the standard deviation between the predictions obtained by Monte-Carlo samples on every data point. Consequently, the mean standard deviation is averaged over all c classes which can be derived from:

Random Sampling: Baseline Model
This acquisition function uniformly draws data points from the unlabelled pool at random.

Methods
In this section, we provide a detailed description of Elliptic data. Then we demonstrate temporal-GCN which is the proposed classification model to classify the illicit transactions in this dataset.

Dataset Description
We use the Bitcoin dataset launched by Elliptic company that is renowned for detecting illicit services in cryptocurrencies [3]. This dataset is formed of 49 directed acyclic graphs wherein each is extracted on a specific period of time represented as time-step t, referring to Fig. 1. Each fully connected graph network incorporates nodes as transactions and edges as the flow of payments. In total, this dataset is formed of 203,769 partially labelled transactions, where 21% are labelled as licit (e.g., wallet providers, miners) and 2% are labelled as illicit (e.g. scams, malware, PonziSchemes, …). Each transaction node acquires 166 features such that the first 94 belongs to local features and the remaining as global features. Local features are derived from the transactions' information on each node (e.g. time-step, number of outputs/inputs addresses, number of outputs/inputs unique addresses …). Meanwhile, global features are extracted from the graph network structure between each node and its neighbourhood by using the information of the one-hop backward/forward step for each transaction.
In this study, we use the local features which count to 93 features (i.e. excluding time-step) without any graph-related features.

Temporal Modelling
We refer to the presented model by temporal-GCN. This model is a combination of LSTM and GCN models which are detailed in what follows.

Long Short-Term Memory (LSTM)
Initially, LSTM is proposed by [30] as a special category of recurrent neural networks (RNNs) in order to prevent the vanishing gradient problem. LSTM has proven its efficacy in many general-purpose sequence modelling applications [31][32][33]. Given a graph network of Bitcoin transactions as G = (V, E) with its adjacency matrix A ∈ R n×n , degree matrix D ∈ R n×n , where V and E are the sets of nodes as Bitcoin transactions and edges as payments flow, respectively, with |V|= n being the total number of nodes. Consider x t ∈ R d x as the node feature vector with d x -dimensional features and layer output h t ∈ [− 1, 1] d h as and states c t ∈ R d h ∈ Rdh with d h -dimensional embedding features.
Then, the fully connected LSTM layer, referring to [34], can be expressed as: where is the Hadamard product, σ (.) is the sigmoid function and tanh(.) is the hyperbolic tangent function. The remaining notations refer to LSTM layer parameters as follows: i t , f t , o t ∈ [0, 1] d h are the input, forget, and output gates, respectively. The weights W express the parameters of the LSTM model.

Topology Adaptive Graph Convolutional Network: TAGCN
In @@this paper, we use a graph learning algorithm called TAGCN as introduced in [35] which stems from the GCN model. Generally, GCNs are neural networks that are fed with graph-structured data, wherein the node features with a learnable kernel undergo convolutional computation to induce new node embeddings. The kernel can be viewed as a filter of the graph signal (node), wherein the work in [36] suggested the localisation of kernel parameters using Chebyshev polynomials to approximate the graph spectra. Also, the study in [37] has introduced an efficient algorithm for node classification using first-order localised kernel approximations of the graph convolutions.
whereÂ is the normalization of A defined by: A is the adjacency matrix of the graph G with the added self-loops. σ denotes the typical activation function such as ReLU (.) = max(0,.). H (l) is the input node embedding matrix at l th layer. W (l) is a trainable weight matrix used to update the output embeddings H (l+1) . On the other hand, the work in [35] has introduced TAGCN which is based on GCN but with fixed-size learnable filters and adaptive to the topology of the graph to perform convolutions in the vertex domain. Consequently, TAGCN can be expressed as follows: where k is a learnable weight matrix at k-hop from the node of interest.

Overall Model: Temporal-GCN
Since TAGCN in [35] requires no approximations in comparison to GCN by [37], we exploit the performance of TAGCN in Bitcoin data. Motivated by the work in [38] that has suggested feeding LSTM inputs with GCN node embeddings, temporal-GCN seeks to perform LSTM that learns the temporal sequence in which after is forwarded non-linearly to 2-TAGCN layers to exploit the graph structure of Bitcoin transaction graph. The temporal-GCN model can be expressed as: where X is the node feature matrix. LSTM(.) and TAGCN(.,.) are layers mapping a given input to an output from Eqs. 14 and 15, respectively. Softmax function is defined as

Experimental Setup
Using the Elliptic data, we split the data following the temporal split as in [3], so that the first 35 graphs (i.e., t = 1 → t = 35) account for the train set and the remaining are kept for testing. Since this dataset comprises partially labelled nodes, we only use the labelled nodes which add up for 29,894/16,670 transactions in the train/test sets, respectively. To train temporal-GCN, we use Pytorch Geometric (PyG) package [39] which is built on the top of Pytorch (version 1.11.0) enabled-CUDA (version 11.3) in Python programming language. At each time-step t, we feed the relevant graph network with its node feature matrix (i.e., local features excluding timestep) to the temporal-GCN model that is summarised in Eq. 16. LSTM layer uses only the nodes features without any graph-structural information to provide the output matrix H (1) . This matrix is then forwarded to 2-TAGCN layers (in H (2) and H (3) ) that consider the graph-structured data of the top-K influential nodes in the graph, where K is kept by default equal to 3. Hence, a Softmax function provides the final class predictions as licit/illicit transactions. We choose NLLLoss function and Adam optimiser in order to compute the loss and update the model's parameters. Using the same hyper-parameters from [5], the widths of the hidden layers are set to 50, the number of epochs is set to 50 and the learning rate is fixed at 0.001. Furthermore, we empirically opt 0.7 for the dropout ratio to avoid overfitting. The classification results of the temporal-GCN model are provided in Table  1.

Active Learning
Active learning has a significant impact to alleviate the bottleneck of labelling especially with this type of data. The main goal of active learning is to use less-training data with achieving acceptable performance. Here, we initially assume the train set as a pool of unlabelled data D pool and we consider D train as an empty set to be appended after the querying process. First, the process starts by randomly querying the first batch size for manual labelling, which is arbitrarily assigned to 2000 instances. Afterwards, we append the selected queries to D train from D pool to train the temporal-GCN model that is evaluated using the test set at each iteration. Subsequently, the same process is repeated using the uncertainty sampling strategy until we reach an adequate accuracy. However, we query for all instances in D pool . The uncertainty sampling is performed by using one of the acquisition functions demonstrated earlier. These acquisition functions require as input the uncertainty estimates derived by the uncertainty estimation methods. To imitate manual labelling, we append the labels to the queried instances. This experiment is performed using MC-dropout and MC-AA. We compare the performance of the active learning frameworks that use various acquisition functions on the two distinct uncertainty estimation methods. Regarding the hyper-parameters for producing uncertainty estimates, we arbitrarily set T = 50 for multiple stochastic forward passes on the unlabelled pool for MC-dropout. With MC-AA, we arbitrarily choose ε T = 0.1 and T = 10.
In addition, we perform random sampling as a baseline which uniformly queries data points at random from the pool. The process of performing active learning with the temporal-GCN model is schematised in Fig. 2. The required time to perform the active learning process in an end-to-end fashion using parallel processing, referring to Fig. 2, is provided in Table 2 using various acquisition functions under the given uncertainty methods.

Results and Discussion
We discuss the results of the temporal-GCN model in the light of the previous studies using the same dataset. Subsequently, we provide and discuss the results provided by various active learning frameworks. Then we apply a non-parametric statistical method to discuss The time is provided for each experiment that uses the corresponding acquisition function which relies on a given uncertainty estimation method the significant difference between MC-AA and MC-dropout in performing active learning in comparison to the random sampling strategy.

Performance of Temporal-GCN
Temporal-GCN has outperformed all previous studies on this dataset that uses local features under the same train/test split settings. The presented model has leveraged temporal sequence and the graph structure of the Bitcoin transaction graph, wherein the classification model can detect illicit transactions with accuracy and f 1 -score equal to 97.77% and 80.60%, respectively. In previous studies, Evolve-GCN has attained an accuracy of 96.8%. The latter model has exploited the dynamicity of the graph by performing LSTM on the weights of the GCN layer, which outperformed GCN and skip-GCN without any temporal information. Whereas

Evaluation of Active Learning Frameworks
Referring to Fig. 4, we plot the results of various active learning frameworks using various acquisition functions (BALD, Entropy, Mean STD, Variation Ratio) which in turn utilise MCdropout and MC-AA uncertainty estimation methods. Moreover, we plot the performance of the baseline model using a random sampling strategy. In the first subplot, BALD has revealed a significant success under MC-AA and MC-dropout uncertainty estimates which active learning is effectively better than the random sampling model. With the remaining acquisition functions, MC-dropout has remarkably achieved a significant outperformance over MC-dropout and the random sampling model. MC-AA that is utilised in entropy and variation ratio acquisition function has not performed better than random sampling. Furthermore, the active learning framework using the BALD acquisition function in Fig. 4 is capable of matching the performance of a fully supervised model after using 20% of the queried data. This amount of queried data belongs to the second iteration. In our experiments, MC-AA has been revealed to be a viable method as an uncertainty sampling strategy in an active learning approach with BALD and Mean STD acquisition functions. This is reasonable since the latter two methods estimate the uncertainty based on the severe fluctuations of the model's predictions on a given input wherein MC-AA suits this type of uncertainty.  Table 2, BALD acquisition has recorded the shortest time among other acquisition functions using MC-AA, where this framework has been processed in 28.07 minutes using parallel processing. Whereas the longest time by MC-AA is recorded by the entropy and variation ratio. For MC-dropout, the shortest time is recorded by Mean STD acquisition which is 27.09 min. Whereas the framework using variation ratio has revealed the longest time which is 28.3 min. We also note that the frameworks using MC-AA require more time than the ones using the MC-dropout method. This is due to the adversaries computed by MC-AA which requires more time.

Wilcoxon Hypothesis Test
To show the statistical significance of the various acquisition functions that appeared in Fig. 4, we perform the non-parametric Wilcoxon signed-rank test [40]. It is used to test the null hypothesis between two paired samples based on the difference between their scores. Given two paired samples P = {p 1 , …, p m } and Q = {q 1 , …, q m }, then the absolute value of the difference between the samples.
This can be expressed as: where m is the number of samples of each set. In summary, this test accounts for the statistical difference between the sets P and Q. The Wilcoxon test compares a test statistic T to Student's t-distribution. To perform this test, we use the Wilcoxon function in sklearn [41]. Referring to  Table  3. The p values-which are lower than the significance level α-from BALD acquisition function are statistically significant against the null hypothesis in which there is statistical evidence about the difference between each of MC-AA and MC-dropout in comparison to random sampling acquisition. Moreover, MC-dropout against random sampling has shown a statistical significance against the null hypothesis using the variation ratio and Mean STD acquisitions where the p values are 0.071 and 0.0112, respectively. There is no evidence against the null hypothesis for the entropy where the p values are 0.2552 and 0.309 are greater than α.

Ablation Study
In this section, we present the ablation studies performed on the proposed temporal-GCN model. Referring to Table 4, we have studied the importance of using LSTM and TAGCN layers. The Model-0 corresponds to the model performed in the experiments. Subsequently, we have applied a combination of replacing each of the given layers with a linear layer (Model-1, Model2, Model-3). We have also studied the case in which we remove one of the first two layers from the original model (Model-6, Model-7). Furthermore, we have shown the performance of the models using either LSTM (Model-4) with linear layers. In Model-2, the replacement of the second layer with a linear layer has attained the highest Each model number is provided by its architecture by changing the layers into linear layer or removing one of the layers. The term'None' in the cells correspond to the model having one of its layers removed accuracy in comparison to Model-0. The removal of LSTM in all cases has provided a drop in the model's performance, especially in Model-7 which reveals the lowest accuracy. On the other hand, using LSTM without the graph learning algorithms in Model-2 has recorded the second-lowest accuracy in this study. We have also tweaked the K hyper-parameter that appeared in TAGCN referring to Eq. 15. The original model uses, by default, K=3 which means that neighbourhood information is aggregated up to 3-hops. Then we have checked the performance for K ∈ {1, 2} as provided in Table 5. The highest accuracy has been recorded for using K = 3 and the second one for K=1. Surprisingly, the drop in accuracy is not consistent between the different K-values. This might be due to the informative features derived from neighbouring nodes up to 1-hop and 3-hops.

Conclusion
For anti-money laundering in Bitcoin, we have presented temporal-GCN, as a combination of LSTM and GCN models, to detect illicit transactions in the Bitcoin transaction graph known as Elliptic data. Also, we have provided active learning using two promising methods to compute uncertainties called MC-dropout and MC-AA. For the active learning frameworks, we have studied various acquisition functions to query the labels from the pool of unlabelled data points. The main finding is that the proposed model has revealed a significant outperformance in comparison to the previous studies with an accuracy of 97.77% under the same experimental settings. LSTM takes into consideration the temporal sequence of Bitcoin transaction graphs, whereas TAGCN considers the graph-structured data of the top-K influential nodes in the graph. Regarding active learning, we are able to achieve an acceptable performance by only considering 20% of the labelled data with the BALD acquisition function. Moreover, a nonparametric statistical method, the so-called Wilcoxon test, is performed to test whether there is a difference between the type of uncertainty estimation method used in the active learning frameworks under the same acquisition function. Furthermore, an ablation study is provided to highlight the effectiveness of the proposed temporal-GCN.
In future work, we foresee performing different active learning frameworks which utilise different acquisition functions. Furthermore, we seek to extend the temporal-GCN model to other graph-structured datasets for anti-money laundering in blockchain.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.