Introduction

Blockchain, firstly introduced by Satoshi Nakamoto in his seminal paper,Footnote 1 is a distributed ledger technology that can maintain a growing number of data records without a centralized and trusted third party. Its public, decentralized, and tamper-proof approach—first introduced to solve the double spending problem—has been extended to a wide range of other use cases, from supply chain tracking to secure voting. Ethereum [1], launched in 2015, is the first platform to implement smart contracts, and nowadays one of the most popular blockchain-based platforms to transact and exchange value, with a throughput of 14.8 transactions per second and almost 700000 daily active address on it [2].

With the recent boom of cryptoassets, the topic of anomaly detection in Ethereum [1] has received considerable attention. For example, in Ethereum, the unexpected appearance of particular subgraphs may indicate newly emerging malware or price pump-and-pump trading [3]. Anomaly detection in blockchain transaction networks is an emerging area of research in the cryptocurrency community [4].

The existing methods about anomaly detection onEthereum transaction networks, networks where nodes denote the Ethereum transaction addresses and edges denote transactions between addresses, can be roughly divided into two categories [4]:

  • Traditional machine learning with manual-extracted features: these methods first manually extract the transaction features (such as, transaction value), then, they apply traditional machine learning classification methods to perform the classification tasks. However, these feature engineering methods need expert knowledge, and the classification results mainly rely on the manual-extracted features, which ignore the structural information of the transaction network resulting in suboptimal results. And these methods are unable to automatically extract the transaction network features.

  • Node representation learning: these methods adopt the random walk node representation learning [5, 6] or graph neural network (GNN) [4] technique to automatically learn the deep features in transaction network to obtain each node’s embedding for further classification task to identify the anomaly node and normal node. However, most of them ignore the timestamp information or the transaction flow direction information.

These methods overall achieve excellent results. However, there are still some problems yet to be solved, and room for further improvement:

  1. 1.

    Multi-token transaction An address (account) may involve multiple token transaction network. For example, money laundering tends to move funds across multiple cryptocurrency ledgers [7].

  2. 2.

    Temporal transaction information The existing methods either ignore the edge timestamp information, or only consider the last transaction record without fully model the information in transaction network.

  3. 3.

    Growing of transaction data An amount of nodes in the transaction network, e.g., the largest number of new nodes in 1 day is more than 10,000 in BNB (Binance Coin) token [8]Footnote 2 transaction network. When 1-year transaction information and multi-token transactions are considered, the transaction network becomes so large that is difficult to analyze.

  4. 4.

    Edge’s feature The original constructed transaction network we crawl by Ethereum Client with Ethereum-ETL is a weighted directed graph. There are transaction value and transaction flow direction information on each edge. The traditional graph neural network [9] technique is unsuitable for weighted directed networks. Because the adjacency matrix is symmetrical in the traditional graph neural network, and it is not easy to define a symmetric, real-valued laplacian for the directed graph.

To address the above challenges, we propose a Multi-layer Temporal Transaction Anomaly Detection (MT\(^2\)AD) model in Ethereum network with graph neural network technique, and transform it as a graph classification task. It mainly contains three modules:

  1. 1.

    Extraction of token transaction network initial feature. It includes the token subgraph sampling and modeling edge’s feature to address the challenges 3 and 4.

  2. 2.

    Multi-layer token transaction network snapshot construction. It includes the token network snapshot extraction of a given token subgraph and multiple token snapshots combination at the same timestamp to address the challenge 1 and 2.

  3. 3.

    Anomaly detection with graph representation learning. It aims to transform the anomaly detection as a graph classification task.

We obtain the multiple token transaction information by Ethereum Client with Ethereum-ETL,Footnote 3 and model each token transaction information as a temporal weighted directed graph, where nodes represent the transaction address, edges represent the transaction between two address and its direction denotes the transaction flow from one address to another address, edge weight denotes the transaction value, and timestamp information is on each edge. Extraction of token transaction network initial feature module includes subgraph feature sampling and edge’s feature modeling. Subgraph sampling component [10] is to sample nodes on each token transaction network. Because most daily token networks have <100 nodes [11], subgraph sampling can reduce the size of transaction network to be processed and has minimal impact on results. First, we extract fixed-size nodes according to the p most active edges in each transaction network to obtain a subgraph. Besides, as the subgraph is a weighted directed graph, each edge has transaction value and direction information, which are important features in a transaction network. In order to reduce information loss when learning graph representation with graph neural network encoder, we transform the transaction value and direction information on each edge as node attributes. Second, the multi-layer token transaction network snapshot construction module is to extract time information and model the cross-cryptocurrency trading patterns. We first transform each token transaction subgraph network as a series of network snapshots according to edge’s timestamp and time window \(\Delta t\). Then, multiple token network snapshots with the same timestamp will be combined as a new combined graph for further capturing the cross-cryptocurrency trading patterns with graph neural network. Third, in anomaly detection with graph representation learning module, the graph neural network with attention mechanism [12] is utilized to encode each node information as a embedding on each combined graph. Not only the node’s structural context information, and the impact from its different neighborhoods are considered, but also the common information among different token transaction networks are captured with the same graph encoder. Then, the mean pooling operation is adopted on nodes’ embeddings of each combined graph to achieve entire graph-level representation (i.e., a vector), for further graph classification (normal or abnormal). Extensive experiments are carried out on three real-world transaction network datasets to demonstrate the effect and effectiveness of our proposed method.

In summary, the key contributions of this paper are:

  • We introduce a Multi-layer Temporal Transaction Anomaly Detection (MT\(^2\)AD) model in Ethereum network with graph neural network technique. Compared with the existing anomaly detection methods for blockchain transaction network, our MT\(^2\)AD considers the cross-cryptocurrency trading patterns.

  • The anomaly detection in multi-token transaction network is naturally transformed as a graph classification task to classify the combined graph (normal or abnormal) with the graph-level embedding.

  • Extensive experiments on three real-world transaction datasets including two Ethereum multi-token transaction networks and Ripple multi-blockchains networks demonstrate the effectiveness of our proposed model in comparison with some competing approaches.

The remainder of this paper is organized as follows. In Sect. 2, we summarize some researches related to our work. “Problem definition” introduces the problem statements, and “Model” is our method. In “Experiments”, we carry out our experiments. “Conclusion” concludes this paper.

Related works

Node representation learning

Node representation learning [13,14,15,16,17,18], called node embedding or graph embedding or graph representation learning, is designed to learn low-dimensional dense vector (or embedding) for all nodes in a graph. It maps each node in a graph into a low-dimensional vector space, which captures the nodes’ similarity, network structure and/or other attributes. Inspired by the great success of deep neural networks in computer vision field, Kipf et al. [9] extended the convolutional operation from image data to graph structure data, and proposed Graph Convolutional Networks (GCNs). GCNs can learn both graph structural and node attributes information simultaneously through iteratively aggregating neighbor’s information to current node, i.e., message passing mechanism. On the basis of GCNs, GAT [12] (Graph Attention Networks) utilizes the attention mechanism to learn node importance of different neighbors, and GraphSAGE [19] is an inductive representation learning on large graphs. GraphSAGE generates node embeddings by sampling fixed local node neighborhoods and aggregates these node’s features. Since then, lots of graph neural network algorithms have been proposed. The representative methods are Graphormer [20], Auto-GNAS [21], CS-TGN [22] and DeepRank-GNN [23].

Anomaly detection in blockchain transaction network

Trans2vec [6] is a phishing scam detection on Ethereum transaction network based on network embedding technique. It first constructed the Ethereum transaction network as a weighted directed network. Then, on the basis of node2vec [13], Trans2vec proposed a random walk-based network embedding method taking the timestamp and edge transaction value into consideration to sample node sequences. Finally, it utilizes one-class support vector machine to perform node classification into normal or phishing account. LiangChen [5] proposed a phishing scam detection method in Ethereum transaction network based on graph autoencoder. It first utilized the random walk technique to sample a fixed-size subgraph for a random selected node. Then, it utilized GCN to extract the feature of each subgraph. Finally, the output of GCN and attributed matrix were concatenated as the final result for further phishing account classification with LightGBM.Footnote 4 Yijun [24] proposed an attributed ego-graph embedding framework to perform phishing detection on Ethereum transaction network through extracting ego-graphs for each labeled account, and then learned graph representations for each ego-graph through a Skipgram model [25]. This article adopted a decision tree to perform detection task on the obtained embeddings. TTAGN [4] is a temporal transaction aggregation graph network representation learning method for Ethereum phishing scams detection. It modeled the temporal relation of historical transaction records between nodes to construct the edge representation of Ethereum transaction network. Besides, the edge representations were fused with topological interactive relations through the graph neural network technique for further phishing address detection. TSGN [26] is the transaction subgraph networks for identifying Ethereum phishing accounts. It built the Ethereum phishing account identification as a graph classification task. TSGN first constructed a set of transaction subgraph for each phishing address and normal address. Each obtained transaction subgraph had a label (phishing or normal). Then, TSGN learned the graph embedding of each transaction subgraph. The last step used the obtained graph embeddings to train a classification to predict the labels of each transaction subgraph. However, these anomaly detection methods based on the graph representation learning technique are all only taking single transaction network into consideration. A growing number of cryptocurrency criminals utilize cross-cryptocurrency trades to hide their identity [7]. Ofori-Boateng [11] has demonstrated that considering multi-token transaction networks can improve the anomaly detection accuracy. In [11], a stacked persistence diagram (SPD) method were proposed to perform the topological anomaly detection on dynamic multi-layer blockchain networks. However, the edge’s features were not taken into consideration, such as transaction amount between different accounts, node’s degree, etc. These features are important to classify account into normal and anomalous one. Besides, it is unable to automatically learn the deep features in transaction networks taking advantage of the deep learning technique. Readers can refer to the survey [27] about graph anomaly detection with deep learning technique.

Graph classification using graph neural network

Given a set of graphs \(\{{g_1},{g_2}, \ldots ,{g_n}\}\), and only partial graphs with labels, the graph classification is to predict the label of unseen graphs. Graph classification can be divided into graph kernel algorithms [28, 29] and neural networks technique [30]. Graph kernel methods [28, 29] is to define a distance to measure the similarity between two graphs, and classify them using this metric, including random walk kernel, shortest path kernel, Weisfeiler Lehman subtree kernel and deep graph kernel et al. However, graph kernel algorithms need handcrafted features (random walk, shortest path, etc.) and hence result in poor generalization. In this paper, we focus on using neural network technique to perform graph classification. The classical algorithm is Graph2vec [30]. It is the first neural embedding framework to learn data-driven distributed representations of arbitrary sized graphs, and is an unsupervised methods. Graph2vec introduced the Skipgram model [25] of word2vec to graph, and it considered a graph as a document, and the rooted subgraph around the nodes as words. On the basis of Graph2vec [30], GL2vec [31] utilized the line graphs (edge-to-vertex dual graphs) of input graphs to handle the edge labels and captured more structural information. Another graph classification algorithms using neural network technique are to perform multiple feature transformations on the input graphs by graph convolution operation, and then performs a pooling operation to reduce the scale of input graph. This process can be repeated many times, and finally the entire graph-level representation (i.e., a vector) is obtained for further graph classification. The representative algorithms are DiffPool [32] and SAGPool [33]. GAU-Nets [34] is a graph neural network model that has strong processing capabilities for graph structures. The model is proposed for image classification and has been tested on MS-COCO and other datasets. The experimental results have proved that GAU-Nets performed better than other traditional graph neural network models. Graph U-Nets [35] is a graph neural network model that uses an encoder–decoder architecture for graph data to the node classification and graph classification tasks. DGCNNII [18] is an end-to-end model called Deep Graph Convolutional Neural Network II, that can perform graph classification with up to 32 layers. It can extract multiscale features of nodes. Readers can refer to the survey [36] about the graph classification algorithms.

Problem definition

Definition 1

(Dynamic graph) A dynamic (or temporal) graph can be represented as \(G_d = \{ {G_{t_1}}, {G_{t_2}}, \ldots , {G_{t_T}} \}\), the \(G({t_t}) = (E({t_t}),V({t_t}))\) is the network snapshot at timestamp t, where \(E({t_t})\) is the edge set and \(\vert E({t_t}) \vert \ge 1\), all vertices in the edge set \(E({t_t})\) at timestamp t are nodes of \(G({t_t})\) denoted by \(V({t_t})\). The maximum timestamp is T. It should be noted that not all nodes in G will be known at timestamp t, as some new nodes may emerge at timestamp \(t'\) for any \(t'>t\).

Definition 2

(Multi-layer graph) A multi-layer graph can be represented as \(G_m = \{ {G^{l_1}}, {G^{l_2}}, \ldots , {G^{l_L}} \}\) with \(\vert L\vert \) layer graph, where \({G^l} = ({V^l},{E^l})\) is the l-th graph, \(\vert V^l\vert \) or \(n^l\) and \(\vert E^l \vert \) represent the number of nodes and edges respectively.

Definition 3

(Attributed multi-layer graph) When the nodes in the multi-layer graph contain attributes information, this multi-layer graph \(G_m\) is called attributed multi-layer graph. And the node’s attributes matrix in \({G^{l_L}}\) is denoted by \(X^{l_L} \in {\mathrm{{\mathcal{R}}}^{n^{l_L} \times F}}\), where F is the node attribute dimension and \({n^{l_L}}\) is the number of nodes in \({G^{l_L}}\).

Fig. 1
figure 1

The framework of MT\(^2\)AD. It includes three modules: (a) Extraction of token transaction network initial feature module is to sample fixed number of nodes according to the p most active edges in each token transaction network to achieve a subgraph feature. At the same time, the edge’s feature, i.e., transaction value and transaction flow direction information, are transformed as node’s attributes. (b) Multi-layer token transaction network snapshot construction module is first to transform each token subgraph as a series of snapshot networks according to the timestamp of the edges and time window \(\Delta t\), and then to combine multiple token network snapshots with the same timestamp t into a new combined graph \( {G_{{t_t}}^{{m}}} = \{{ {G_{{t_t}}^{{l_1}}}, {G_{{t_t}}^{{l_2}}}, \ldots , {G_{{t_t}}^{{l_L}}} } \} \). (c) Anomaly detection with graph representation learning module is to encode each \({G_{{t_t}}^{{m}}}, t \in \{ 1,2, \ldots ,T\} \) as a vector for further graph classification

Definition 4

(Dynamic attributed multi-layer graph) A dynamic attributed multi-layer graph with T timestamps and L tokens transaction networks can be denoted as \(G_{dm} = \{ \{{ {G_{{t_1}}^{{l_1}}}, {G_{{t_1}}^{{l_2}}}, \ldots , {G_{{t_1}}^{{l_L}}} } \}, \ldots , \{{ {G_{{t_t}}^{{l_1}}}, {G_{{t_t}}^{{l_2}}}, \ldots , {G_{{t_t}}^{{l_L}}} } \}, \ldots , \{{ G_{{t_T}}^{{l_1}}, G_{{t_T}}^{{l_2}}, \ldots , G_{{t_T}}^{{l_L}} } \}\}\). Where \( {G_{{t_t}}^{{m}}} = \{{ {G_{{t_t}}^{{l_1}}}, {G_{{t_t}}^{{l_2}}}, \ldots , {G_{{t_t}}^{{l_L}}} } \} \) is the multi-layer graph at timestamp t. Besides, all nodes in \(G_{dm}\) have node attributes with same dimension F.

Definition 5

(Graph classification with representation learning) The purpose of graph classification is to predict the class of a graph. In this paper, we consider the anomaly detection in dynamic attributed multi-layer token transaction network as a graph classification task.

Learning a map function f to encode n nodes in an attributed multi-layer graph \({G_{{t_t}}^{{m}}}\) at timestamp t into d-dimensional dense vector space, which can be represented as a matrix \(H \in {\mathcal{R}^{n \times d}}(0 < d \le n), H = \{h_1, h_2, \ldots , h_n\} \). Then, the pooling operation (mean pooling or max pooling etc.)—readout function [36]—is adopted on matrix H to generate a vector (mean pooling example: \(\mathcal{R}(H) = \sigma (\frac{1}{n}\sum \nolimits _{i = 1}^n {{h_i}} )\), \(\sigma \) is the Sigmoid function). \(\mathcal{R}(H)\) is the graph-level representation of \({G_{{t_t}}^{{m}}}\) and contains the entire graph information. There are T multi-layer token graph snapshots, which means we can achieve T graph-level representation vectors. The classifier is trained with those T graph-level embeddings to classify its corresponding combined graph as normal or abnormal.

Model

Overall framework

The proposed MT\(^2\)AD, a Multi-layer Temporal Transaction Anomaly Detection method in Ethereum networks with GNN, includes three modules, which are extraction of token transaction network initial feature module, multi-layer token transaction network snapshot construction module, and anomaly detection with graph representation learning module. The over framework is shown in Fig. 1.

Extraction of token transaction network initial feature

We obtain the ERC-20 token transaction records by Ethereum Client with Ethereum-ETL. According to the specification, ERC-20 is the Ethereum standard for fungible tokens [8], and is the interface that a smart contract can implement to exchange this kind of tokens. In this paper, the Binance Coin (BNB), Tether (USDT)Footnote 5 and Chainlink (LNK)Footnote 6 ERC-20 Ethereum-based token transaction records are extracted by Ethereum Client with Ethereum-ETL. We construct transaction records of each token as a transaction network, where nodes represent the addresses and edges represent the transaction between a pair of addresses. It should be noted that the token transaction network is a temporal weighted directed network, where weight represents the transaction amount from the source address to the destination address. The timestamp information (an integer number here) on each edge represent the time that transaction is executed. Given the large size of the original token transaction network, in extraction of its initial feature module, we will first extract its subgraph feature (i.e., subgraph sampling), and then model edge’s feature as node attributes for further graph convolution operation [9].

Token subgraph sampling

The scale of original transaction network is so large, it is not possible to address such a large dataset using the common GPU with deep learning technique. Taking BNB token transaction network as an example, the number of new daily nodes reaches up to around 10,000 in just 1 day, as shown in Fig. 2. If the BNB transaction information of 1 year are taken into consideration, the transaction network will be very large. Therefore, to maintain reasonable computation time, the first step is to sample the original token transaction network to achieve a subgraph that retains the prominent feature. The common and simple sample method is the random walk technique, which has been utilized in some related works [26]. However, because of the inherent randomness of random walk, therefore, we adopt the maximum weight subgraph sampling method [10] on original token transaction network to achieve a small size transaction subgraph. We restrict the size of transaction subgraph through extracting the p most active edges. The obtained transaction subgraph has minimal impact on results. The reason is that even for the most traded tokens (Tronix and Bat), only top 150 nodes in daily networks form 75% and 80% of all edges [11]. Subgraph sampling is not only able to reduce the computation time and GPU requirement, but also retain the important feature as much as possible [11].

Fig. 2
figure 2

The number of new nodes every day for the BNB token transaction network

Modeling edge’s feature

There are transaction values and direction information on each edge in each transaction network, and these information are basic and important feature in a transaction network. However, traditional graph convolutional network technique [9] is unsuitable for weighted directed graph. If the transaction values and direction information are transformed as node attributes, to some extent, it can reduce the information loss when learning graph representations with graph convolution operation. In this article, we utilize node attribute matrix X to model edge’s transaction values and direction information. They are described in the following way:

  • Out-degree For a given node, the out-degree means the out-going edges [37]. In transaction network, it represents the number of transactions that the current node sends to other nodes.

  • In-degree For a given node, the in-degree means the in-going edges [37]. In transaction network, it represents the number of transactions that other nodes send to the current node.

  • Degree For a given node, the degree means the one-hop neighborhoods [37]. In transaction network, it represents the total number of edges (or transaction) about the current node. The degree can reflect the activity or importance of a node in a graph.

  • Normalized transaction amount The transaction amount represents the weight between a pair of addresses. For simplify the calculation and better comparison, we normalized the weight (transaction amount) in [0,1] through dividing by the maximum of all node transactions.

Fig. 3
figure 3

Anomaly detection with graph representation learning

Therefore, edge’s feature is transformed as node attributes \(X \in {\mathrm{{\mathcal{R}}}^{n \times 4}}\), which contains four dimensions and n is the number of nodes.

Multi-layer token transaction network snapshot construction

In “Modeling edge’s feature”, the transaction values and transaction flow between linked nodes have been transformed as node attributes. In this section, we will construct the token network snapshot according to edge’s timestamp of a given token transaction network. It mainly includes two steps. First step is to extract token network snapshot according to edge’s timestamp of each token transaction network. Second, at given timestamp t, the multiple token network snapshot \({G_{{t_t}}^{{l_1}}}, {G_{{t_t}}^{{l_2}}}, \ldots , {G_{{t_t}}^{{l_L}}}\) is combined as a new combined graph (i.e., \( {G_{{t_t}}^{{m}}} = \{{ {G_{{t_t}}^{{l_1}}}, {G_{{t_t}}^{{l_2}}}, \ldots , {G_{{t_t}}^{{l_L}}} } \} \) ) for downstream graph representation learning. It is shown in Fig. 1b.

Token network snapshot extraction

Given the l-th dynamic token transaction network \(G_d \), we transform it as a series of network snapshots according to edge’s timestamp. The obtain network snapshots can be represented as \( {G_{{d}}^{{l_l}}} = \{{ {G_{{t_1}}^{{l_l}}}, {G_{{t_2}}^{{l_l}}}, \ldots , {G_{{t_T}}^{{l_l}}} } \} \), where \(t^{\prime } = t + \Delta t\) and \(\Delta t\) is the time window. It means \(t_2 = t_1 + \Delta t\). \(\Delta t\) can be set as different time value according to the requirements of different analysis tasks. In this paper, \(\Delta t\) is set as 1 day (24 h). For different token transaction networks (i.e., different layers \(l_l,l_2, \ldots ,l_L\)), we can obtain its corresponding network snapshots according to timestamp, \( {G_{{d}}^{{l_1}}} = \{{ {G_{{t_1}}^{{l_1}}}, {G_{{t_2}}^{{l_1}}}, \ldots , {G_{{t_T}}^{{l_1}}} } \}, {G_{{d}}^{{l_2}}} = \{{ {G_{{t_1}}^{{l_2}}}, {G_{{t_2}}^{{l_2}}}, \ldots , {G_{{t_T}}^{{l_2}}} } \}, \ldots , {G_{{d}}^{{l_L}}} = \{{ {G_{{t_1}}^{{l_L}}}, {G_{{t_2}}^{{l_L}}}, \ldots , {G_{{t_T}}^{{l_L}}} } \} \)

Multiple token snapshot combination

It is critical to analyze cross-cryptocurrency transactions [7] for identifying illicit activities on blockchain. The existing research [7] has shown that the patterns of behavior is largely based on the role they play in tracking money as it moves across the ledgers of different cryptocurrencies [7]: therefore, there are common patterns of behavior across different token transaction networks. However, in “Token network snapshot extraction”, we only obtain a series of network snapshots of each dynamic token transaction network respectively. To take advantage of different token transaction networks information and capture common feature among different token transaction networks with graph representation learning, we combine multiple token transaction networks information at a given timestamp \({t_t}\) as a graph \( {G_{{t_t}}^{{m}}} = \{{ {G_{{t_t}}^{{l_1}}}, {G_{{t_t}}^{{l_2}}}, \ldots , {G_{{t_t}}^{{l_L}}} } \} \). \({G_{{t_t}}^{{m}}}\) is the combined transaction graph from \({l_1},{l_2}, \ldots , {l_L}\) token transaction networks at timestamp \({t_t}\), i.e., network snapshots \( \{{ {G_{{t_t}}^{{l_1}}}, {G_{{t_t}}^{{l_2}}}, \ldots , {G_{{t_t}}^{{l_L}}} } \} \). In this way, we can obtain T combined transaction graph \({G_{{t_1}}^{{m}}}, {G_{{t_2}}^{{m}}}, \ldots , {G_{{t_T}}^{{m}}}\) for further graph classification.

Anomaly detection with graph representation learning

In this section, we will introduce our graph encoder to learn node embedding of each combined transaction graph (i.e., \({G_{{t_t}}^{{m}}}, {t_t} \in \{ {t_1},{t_2}, \ldots ,{t_T}\}\)) for graph pooling operation to achieve entire graph-level embedding, and further graph classification anomaly detection to predict the label of each combined transaction graph. The labels of each combined transaction graph are generated according to the Blockchain daily events from Wikipedia.Footnote 7 And these labels are also used in the article [11]. Therefore, we construct the labels of each combined transaction graph according to the Blockchain daily events from Wikipedia. The labels of each combined transaction graph is anomaly and normal. Therefore, in the paper, we transform the anomaly detection as a graph classification task. We learn the graph-level embedding of each combined transaction graph. Then, we use the graph embeddings to train the graph classifier to predict the labels of each combined transaction graph. The multiple token transaction networks anomaly detection with graph representation learning is shown in Fig. 3, and it includes Graph Convolution Layer, Pooling Layer and Graph Classification Layer.

Graph information encode

Graph convolution layer To perform the anomaly detection on combined graph with representation learning technique, we first have to obtain each node embedding in \({G_{{t_t}}^{{m}}}\). The pooling operation is then utilized on all node embeddings in \({G_{{t_t}}^{{m}}}\) to achieve the overall graph-level embedding, which is a vector retaining overall graph feature information used for further graph classification. However, in token transaction networks, the node usually conducts a transaction with multiple nodes at the same time and the interactive node pairs usually have similarity characteristic. For example, if a node is a black node in black market transaction, nodes with direct transactions with this black node may also be black. Besides, the nodes itself in original transaction network only have structure feature, and the transformed node attributes from edge’s features in “Modeling edge’s feature” are sparse. If only the structure feature and node attributes of node itself are taken into consideration when performing the graph encoder operation, it will obtain an unsatisfied node embedding. To solve this problem, we take advantage of the neighborhood information by aggregation operations which integrates the information from its neighbors to the current node. However, different neighbor nodes have different effects to the current node. Therefore, we utilize the multi-head attention mechanism [12] to capture the difference.

For each node in a given transaction network, we first learn the effect weights between the node itself and its neighbor nodes. The weighted aggregation operation will then be performed according to the obtained weights. For a given node \({n_i}\), its one-hop neighborhoods are \({N({n_i})}\), and \({n_j} \in {N({n_i})}\). The effect weights can be learned given the feature of \({n_i}\) and its neighborhoods \({N({n_i})}\) with the following method:

$$\begin{aligned} {e_{ij}} = \sigma (a_w^T([W \cdot {h_i}]\Vert [W \cdot {h_j}])) \end{aligned}$$
(1)

where \({h_i}\) and \({h_j}\) are the node features of \({n_i}\) and \({n_j}\), respectively. W is the learnable parameter matrix. \(a_w^T\) is the attention parameter matrix. \(\cdot ^T\) denotes the transpose and \(\Vert \) is the concatenate operation. \(\sigma \) is the activation function.

Then, the softmax function is adopted to transform the effect weights coefficients to [0, 1] in Eq. 2.

$$\begin{aligned} {a_{ij}} = \frac{{\exp ({e_{ij}})}}{{\sum \nolimits _{{n_j} \in N({n_i})} {\exp ({e_{ij}})} }} \end{aligned}$$
(2)

Finally, the node embedding of \({n_i}\) can be generated through the aggregation operation with multi-head attention [12] in Eq. 3. The schematic diagram of multi-head attention is shown in Fig. 4.

$$\begin{aligned} h_i^{\prime } = \mathop {\Vert }\nolimits _{k = 1}^K \sigma \left( {\sum \limits _{{n_j} \in N({n_i})} {a_{ij}^k{W^k}{h_j}} } \right) \end{aligned}$$
(3)

where \({\Vert }\) represents the concatenation operation, \({a_{ij}^{k}}\) are normalized attention coefficients through the k-th attention (\({{a^{k}}}\)). \({W^k}\) is the linear transformation matrix. \({h_j}\) is the node embedding of neighborhood of node \({n_i}\).

Fig. 4
figure 4

Aggregating node information with multi-head attention mechanism

Pooling layer Through the aggregation operation with attention mechanism, we can obtain each node embedding (\({H_{{t_1}}^{{m}}},{H_{{t_2}}^{{m}}}, \ldots ,{H_{{t_T}}^{{m}}}\), where \({H_{{t_t}}^{{m}}} = \{h_1, h_2, \ldots , h_n\}_{{t_t}}^{{m}}\)) in combined graph at each timestamp i.e., \({G_{{t_1}}^{{m}}},{G_{{t_2}}^{{m}}}, \ldots ,{G_{{t_T}}^{{m}}}\). Then, we adopt readout function (mean pooling operation [38], Eq. 4) on all nodes of a given graph \({G_{{t_t}}^{{m}}}\) to achieve the overall graph-level representation, which means to reduce a whole graph to a single vector for further graph classification anomaly detection.

$$\begin{aligned} \mathcal{R}(H)_{{t_t}}^{{m}} = \sigma \left( \frac{1}{n}\sum \limits _{i = 1}^n {{h_i}} \right) _{{t_t}}^{{m}} \end{aligned}$$
(4)

where \(\sigma \) is the Sigmoid function, \(\mathcal{R}\) means the readout function.

Graph classification loss function

In this paper, we utilize the graph classification task to perform anomaly detection, and each \({G_{{t_t}}^{{m}}}\) corresponds to a label (normal or abnormal). We minimize the cross-entropy loss to train the proposed model, which is as follows:

$$\begin{aligned} \mathop {\min }\limits _\theta {L_{GC}} = \mathop {\min }\limits _\theta \sum \limits _{{v_i} \in S} {\sum \nolimits _{i = 1}^C {[ - {y_i}\log {{\hat{y}}_i}]} } \end{aligned}$$
(5)

where S represents the training samples, \({y_i}\) is the label of graph \({g_i}\), \({\hat{y}_i}\) is the predicted label of graph \({g_i}\). C represents the number of graph classes. \(\theta \) is the learnable parameters, and we summarize it as \(\theta = \{ {W},{e}\} \).

The loss function is updated by the gradient descent method, and the parameters \(\theta \) are learned via back-propagation algorithm. While the loss function \({L_{GC}}\) is not converged, the partial derivative \(\frac{{\partial {L_{GC}}}}{{\partial \theta }}\) is calculated and the parameters \(\theta \) are optimized via the back-propagation algorithm. The MT\(^2\)AD algorithm is summarized in Algorithm 1.

figure a

The MT\(^2\)AD algorithm

Time complexity

The MT\(^2\)AD include three modules: extraction of token transaction network initial feature, multi-layer token transaction network snapshot construction, and anomaly detection with graph representation learning. In the module about the extraction of token transaction network initial feature, the token subgraph sampling and modeling edge’s feature can be calculated in advance. And in the module about multi-layer token transaction network snapshot construction, the token network snapshot extraction and multiple token snapshot combination can also be calculated in advance. Therefore, the time complexity of the MT\(^2\)AD is mainly about the anomaly detection with graph representation learning. We use the graph convolutional network with multi-head attention mechanism to extract the features of the T combined transaction graph, and the average pooling is used to obtain the graph-level representation vector. The time complexity depends on the type of convolution and the number of layers. The convolution operation on graphs involves multiplying the node features by a weight matrix and then aggregating the features from the neighboring nodes. The weight matrix multiplication has a time complexity of \(O(\vert V \vert F F')\), where F is the dimension of node features and \(F'\) is the number of output feature. The aggregation step has a time complexity of \(O(\vert E \vert F')\), where \(\vert E \vert \) is the number of edges in the graph. Therefore, the overall time complexity of one convolution layer is \(O(\vert V \vert FF' + \vert E \vert F')\). The attention mechanism involves computing a pair-wise similarity score between each node and its neighbors, and then applying a softmax function to normalize the scores. The time complexity of the attention mechanism is \(O( F \vert V \vert ^2)\). In this paper, we use the multi-head attention, each head’s calculations are independent, and can be paralleled. As for the average pooling, the time complexity is \(O(\vert V \vert )\), and \(\vert V \vert \ll {\vert V \vert ^2}\). Therefore, the overall time complexity of MT\(^2\)AD can be simplified as \(O(\vert V \vert FF' + \vert E \vert F' + F \vert V \vert ^2)\).

Experiments

In this section, experimental results of the proposed MT\(^2\)AD for anomaly detection method are presented. We aim to answer the following Research Questions (RQ):

  • RQ1 How the proposed method benefits the anomaly detection of multiple token transaction network?

  • RQ2 How is the feasibility of the model? i.e., loss value change with epoch (it means the number of complete run of training set data in the algorithm.) and single-layer token v.s. multi-layer token transaction network.

  • RQ3 How different parameters (different dimension sizes and different multi-heads attention sizes) affect the model?

Implementation details

Baselines

To illustrate the effectiveness of our methods, we compare our proposed MT\(^2\)AD with some competing methods:

  • MLP [39]: Multi-layer Perceptron (MLP) is a fully connected classification of feedforward artificial neural network (ANN). In this experiment, we use MLP to learn the deep feature in transaction network instead of graph convolutional network. This means that only the node attributes are considered in this setup, without the graph transaction structure subgraph.

  • GCN [9]: GCN is a graph convolutional network to perform the aggregation operations between the current node and its neighbors. In this setup, the attention mechanism with multiple heads are not considered. It means that different neighbors have equal influence on the current node.

  • FEATHER [29]: FEATHER is an algorithm for the representation learning of node-level and graph-level. It considers the neighborhood feature distributions. It is to calculate a specific variant of the characteristic functions defined on graph vertices to describe the distribution of vertex features at multiple scales. These characteristic functions are effectively to create node embeddings. The features extracted by this procedure are useful for machine learning tasks.

  • Graph2vec [30]: Graph2vec is a first neural embedding method to learn representation of entire graph in an unsupervised manner. Graph2vec is similar to doc2vec method [40], which needs a corpus of graphs to learn representations and follows the doc2vec Skipgram [25] training process.

  • GL2vec [31]: On the basis of Graph2vec, GL2vec takes the edge labels into consideration, and it utilizes the line graphs (edge-to-vertex dual graphs) of input graphs. The results of graph embedding is the concatenation of embedding of the input graph and its corresponding line graph.

Datasets

We first obtained the BNB, LNK, USDT token transaction records. As the labels of blockchain transaction network have been verified in [11]. These labels are generated according to the Blockchain events from Wikipedia. Therefore, the time span of BNB, LNK and USDT are same as the label’s. The end time of BNB, LNK and USDT are all May 2018. The start time of BNB, LNK and USDT are July 2017, September 2017 and November 2017, respectively. We use D1 to represent our own dataset. Besides, we also use Ethereum Token Networks (bytom, cybermiles, decentraland, tierion, vechain and zrx) and Ripple Currency Networks (JPY, USD, EUR, CCK and CNY) from [11]Footnote 8 to evaluate the performance of different methods. The D2 and D3 are utilized to represent these two datasets, respectively. Because these two original datasets D2 and D3 only have transaction structure feature. Therefore, we also transform edge’s transaction value and transaction flow direction information as nodes’ attributes. The statistics of these three datasets are shown in Table 1.

Table 1 The statistics of evaluation datasets

Evaluation metrics

The same as article [5], we use metrics Precision, Recall, F-value to evaluate the performance of different methods on anomaly detection. Because F-beta score can adjust relative weight between Precision and Recall with \(\beta \) coefficient to achieve better results (Eq. 6).

$$\begin{aligned} {\text {FbetaScore}} = (1 + {\beta ^2}) \cdot \frac{{{\text {Precision}} \cdot {\text {Recall}}}}{{{\beta ^2} \cdot {\text {Precision}} + {\text {Recall}}}} \end{aligned}$$
(6)

Therefore, in this paper, we use metrics Fbeta–Macro and Fbeta-Weighted instead of F1-value to evaluate the performance of different methods on anomaly detection. The \(\beta \) value is 5, 7 and 7 for D1, D2 and D3 respectively.

Parameter settings

The graph embedding dimension d of all methods is set to 8. For attention mechanism, the size of multi-head attention is set to 2. In our model, we utilize one convolution layer to aggregate the information from its neighbors. Therefore, the convolution size is \(F \times 4\). The learning rate and weight decay are set to 0.0005 and 1e−5, respectively. For GCN, the parameters were set from their official implementations. FEATHER, Graph2vec and GL2vec code implemented in karateclub.Footnote 9

Performance results (RQ1)

To answer RQ1, we compare the performance of all methods in the task of anomaly detection on multiple token transaction network. The results of anomaly detection are shown in Table 2. We have the following observations:

  • Compared with the MLP method, the GCN has greatly improved the performance. The reason is that MLP only considered the node attributes extracted manually, which is not enough to describe the information of multiple token transaction network. GCN takes the node attributes and structure of transaction network simultaneously, and it performs the aggregation operations to integrate the information from neighbors. This demonstrates the advantage of graph convolutional network and the importance of considering the network structure. In transaction network, transaction addresses are anonymous, and the nodes themselves have no attributes. So the difference between nodes is mainly on the transaction structure and edge’s attributes. In this paper, we transform edge’s attributes (timestamp and transaction value) into node attributes for further taking advantage of the graph convolutional network to extract depth feature.

  • In most cases, the performance of FEATHER is better than Graph2vec and GL2vec. The reason is that FEATHER considers the neighborhood feature when creating the node embeddings. The idea of taking the neighborhood feature information into consideration is consistent with ours. Compared with FEATHER and our method, our method can achieve better results. It demonstrates the advantage of our method with graph neural network technique.

  • From the results of GCN method and our proposed approach, it shows that our method achieve excellent results. The reason is that we use multi-head attention mechanism to aggregate information from neighbors. It can capture the difference of influence among different neighbors.

  • Compared with Graph2vec and GL2vec method, our method and GCN achieve better performance. This demonstrates the advantage of graph neural network comparing with skipgram when processing graph network data. Graph neural network is able to capture the node pairs’ similarity from nodes’ attributes and structure simultaneously.

Table 2 The results of anomaly detection

Study of the proposed model (RQ2)

Loss change

To answer RQ2, the D1 dataset is taken as an example to explain the change of loss value on the training and test dataset, and the result is shown in Fig. 5. The x-axis represents the number of epoch and the y-axis represents the changes of loss value. It shows that the training loss value decreases constantly, and it gradually reaches a steady state. The test loss value also decreases during training the model. This demonstrates the effectiveness of our proposed method.

Fig. 5
figure 5

Training and test loss value on D1 dataset

Table 3 Anomaly detection on single token v.s. multiple tokens

Multiple tokens

In this section, we take D1 dataset as an example to evaluate the anomaly detection results on single-layer and multi-layer transaction network, respectively. The results are shown in Table 3. We can observe that the performance on single token transaction network is unsatisfactory when comparing with overall multi-layer token transaction network (three tokens in here). The reason is that only single token is unable to model comprehensive information of transaction network, as there are cross-cryptocurrency trading patterns. MT\(^2\)AD can describe the cross-cryptocurrency trading patterns, and hence it improves the performance. Besides, we use the same graph encoder on all token transaction networks to learn node embedding, this can not only capture each token respective characteristics through the convolution operation according to each token transaction network structure and node attributes, but also capture the common feature on all these multiple token transaction networks through the shared graph encoder.

Ablation experiment

In this section, we do the ablation experiment to test the transformed edge’s feature as node attribute matrix. It means that we only consider the transaction structure information. That is to say, the input of the graph encoder is the adjacency matrix and the identity node attribute matrix. We use the identity node attribute matrix to replace the transformed node attribute matrix. In this case, the edge’s feature (out-degree, in-degree, degree and normalized transaction amount) is not taken into consideration. We take the D1 dataset as an example to test the ablation experiment. The results are shown in Table 4. It shows that when the edge’s feature is ignored, the performance is degraded. Because the identity node attribute matrix do not have any transaction flow and transaction amount information. This demonstrates that the transformed edge’s feature as node attribute matrix can make the graph embedding retain transaction edge’s information when performing the graph convolutional operation.

Sensitivity analysis (RQ3)

To answer RQ3, we further evaluate the performance of our proposed method in terms of different dimension size and multi-head attention size to see how these two hyper-parameters affect the performance.

Effect of attention size

As for the attention size when aggregating the information from different neighbors, the performance changes of our model are shown in Fig. 6. We can observe that the performance of our method on these three datasets achieves different trends under different attention size. On D1 dataset, the FbMacro remains nearly constant, while on D2 and D3 dataset, the FbMacro achieves different values on different dimension sizes. The other three evaluation metrics on these three datasets achieve similar results. It shows that our method can still achieve good performance when the attention size is small (attention size is 2). This demonstrates the importance taking the neighborhoods information and its different influences into consideration.

Table 4 Ablation experiment about the edge’s feature
Fig. 6
figure 6

Attention sizes

Fig. 7
figure 7

Embedding sizes

Effect of embedding size

Embedding size is another hyper parameter that will affect the performance of our method. We vary the embedding size from 4 to 64 to investigate how it influences the graph classification about the anomaly detection on transaction network, and the results are shown in Fig. 7. It can be seen that high embedding size tends to overfitting, and hence the performance decreases. Besides, high embedding size will increase the computational cost. On the contrary, a low embedding size might be unable to capture the transaction network information. Hence, the performance is not good: for example on the results of D1 and D2 datasets, the performance about Precision metrics on a small embedding size (\(d=4\)) is relatively worse than a large embedding size (\(d=8\)). Therefore, we take the embedding size as 8 to balance the computational cost and performance.

Conclusion

In this paper, we proposed an anomaly detection model in multi-layer Ethereum transaction networks with graph representation learning technique and demonstrated its effectiveness through three research questions. We first showed the performance of anomaly detection in comparison with some baselines, which demonstrated the competitiveness of our proposed method. Then, we analyzed the loss value change with epoch and single-layer token versus multi-layer token transaction network to show the feasibility of our method. Finally, we analyzed the sensitivity to parameters on different embedding dimension sizes and different multi-head attention sizes. The key of the proposed model is that our method transforms the anomaly detection into a graph classification task, and model the edge’s features as node’s attributes, which can take advantage of the graph convolution network with attention mechanism integrating the transaction structure and node attributes naturally. Besides, our method can capture cross-cryptocurrency trading patterns on multiple token transaction networks through multi-layer token transaction network snapshot construction module and graph encoder module. These strategies improve the model’s overall performance of anomaly detection.

As a next step, in the future, we will explore the unsupervised graph neural network method to analysis the Bitcoin transaction network. This is specifically important as labelling Bitcoin addresses is a fundamental problem to develop further the growing field of blockchain analytics.