1 Introduction

Blockchain technologies are having a disruptive effect in finance as well as many other fields. Building on cryptographic technologies, blockchains can provide programmable transaction services directly to the masses in a trustless, speedy, and low-cost manner by removing middleman organizations that operate in a classical centralized fashion. Blockchains can keep ownership records of various assets such as cryptocurrencies and tokens that represent various assets such as company shares, stablecoins (which are tokenized forms of fiat currencies), media, and works of art. Blockchains make it easy to transfer and trade crypto-assets all over the world which creates big opportunities for value creation as well as introduce challenges since crypto-assets can move freely among different jurisdictions bypassing regulatory controls.

Figure 14.1 illustrates the global movements of crypto-assets on decentralized autonomous unregulated blockchains and entering regulated financial systems in different jurisdictions. These global crypto-asset movements can make anti-money laundering laws ineffective and crypto-assets obtained from various illegal activities such as ransomware, scam-related initial coin offerings [1], and stolen crypto-assets difficult to catch. In particular, exchange companies provide trading services that allow crypto-assets to be bought and sold with fiat currencies which can then be deposited into or withdrawn from bank accounts. Such mechanisms may allow fraudulently obtained funds to enter the regulated environments. Hence, there is a need to trace such movements of tainted crypto-assets on blockchains so that necessary actions can be taken in order to block them.

Fig. 14.1
figure 1

Global movements of crypto-assets on unregulated blockchains and entering regulated financial systems in different jurisdictions

When the Internet first appeared, there was a big need for web page search services. This has led to the development of search engine companies. There is a similar need in the case of blockchain networks and that is the need to get information on addresses, tokens, and transactions on the blockchain. This scenario brought the development of block explorer services being offered by companies like Etherscan which is currently the most popular explorer service for Ethereum. Block explorer services provide basic information on individual addresses, tokens, and transactions such as amounts, balances, and times of transactions. Blockchain transactions, however, form a directed graph, and it is possible to analyze this graph in order to trace fraudulent activities, get provenance information about tokens representing products, and in general use it for business intelligence. Table 14.1 shows a list of well-known companies that provide services for blockchain transaction analysis. The first six companies, (a)–(f), in the table focus on providing blockchain intelligence services on detection and prevention of financial crime in crypto-assets. These services can be used by financial institutions and government agencies. The seventh company, (g) Etherscan, is the well-known block explorer service company which has recently added a service called ETHProtect that provides transaction tracing information about tainted funds down to their origin. The last three companies in the table, (h), (i), and (j), do not focus on crime detection but rather provide information that can be used for general purpose business analytics.

Table 14.1 Blockchain transaction graph analysis companies

The importance of analyzing large-scale blockchain graphs is further evidenced by the fact that global money laundering and terrorist financing watchdog Financial Action Task Force (FATF) has published a guidance [2] for Virtual Asset Service Providers (VASPs). In this guidance, it is stated that “VASPs be regulated for anti-money laundering and combating the financing of terrorism (AML/CFT) purposes, licensed or registered, and subject to effective systems for monitoring or supervision.” The recent Colonial Pipeline ransomware incident [3, 4] in which hackers invaded the company’s systems and demanded nearly 5 million USD also led to a serious response from the U.S. authorities. This can also be taken as evidence for the importance of tracking fraudulent transactions on blockchains. Finally, we note that the valuation of newly emerging companies that offer blockchain graph analysis services is quite high. For example, Chainalysis raised 100 USD million venture capital at a 1 billion USD valuation [5]. The Graph [6], which as of July 13, 2021 is valued at 750 million USD (as reported in [7]), provides an indexing protocol for querying networks like Ethereum and lets anyone build and publish subgraphs. Several academic papers have also addressed the issue of analyzing the blockchain transaction graphs [8,9,10,11,12,13,14]. All of these facts point to the importance of building a sustainable system for analyzing large-scale blockchain graphs whose sizes are growing and will grow even more rapidly when new blockchain technologies with higher transaction throughput rates will be deployed in the near future.

In the rest of the chapter, we first cover scalability issues in blockchain transaction graph analytics in Sect. 2. In Sect. 3, we cover data structures of transaction graphs. In Sect. 4, we present Big Data Value Association (BDVA) reference model [15] of our blockchain graph analysis system and in Sect. 5 present parallel blacklisted address transaction tracing. In Sect. 6, we present tests of our graph analysis system. Finally, we close the chapter with a discussion and our conclusions in Sect. 7.

2 Scalability Issues in Blockchain Transaction Graph Analytics

Blockchain transaction graph analysis systems have to be scalable. That is, they should continue to perform analysis with slow growing time costs when the transaction numbers increase. As of June 16, 2021, Bitcoin blockchain has 649.5 million and Ethereum has 1.2 billion transactions. The current proof-of-work (PoW)-based Bitcoin and Ethereum blockchains have very low transaction throughputs as shown in Table 14.2. Newer proof-of-stake-based systems like Ethereum2, Avalanche, and Cardano will be able to achieve thousands of transactions per second (tps). Permissioned Hyperledger that is designed for enterprises can also achieve thousands of tps. Such high transaction throughputs mean that the transaction graphs are expected to grow to billions and billions in size in the forthcoming years. For example, if full 4000 tps is performed, it roughly equals 345 million transactions in a day and 2.4 billion transactions in a week.

Table 14.2 Transaction throughputs

In order to handle a massive number of transactions in a graph whose size is growing by billions in a month, the graph analytics software developed must work in parallel and should employ distributed data structures that keep the transaction graph partitioned among multiple processors. If efficient parallel algorithms are employed, such a system can scale simply by increasing the number of processors used. This is the adopted approach in this work, i.e., design a system that employs parallel algorithms and works on distributed memory systems. In Sect. 3, we present an overview of transaction graphs and then in Sect. 4 present the architecture of our system using the BDVA reference model [15, p. 37].

3 Distributed Data Structures of Transaction Graphs

The ledger maintained on the blockchain can be (i) account based or (ii) unspent transaction output (UTXO) based. In account-based blockchains, the coin balance of an address is kept in a single field. In UTXO based systems, there can be several records in the ledger owned by an address, each one keeping track of an amount of coins available for spending. When a payment is performed, a number of records whose total value equals the payment’s amount are taken as input to a transaction and new unspent amount records are generated representing the new amounts available for spending. The payment recipients are the owners of these records. The transaction graphs of account- and UTXO-based systems are shown in Fig. 14.2. Ethereum blockchain is account based, whereas Bitcoin is UTXO based. Account-based transactions can be represented as a directed graph where each vertex represents an address and each directed edge represents a transaction (i.e., asset transfer from one address to another address). A UTXO transaction graph can be represented as an AND/OR graph G(V a, V t, E) with V a as the set of addresses, V t as the set of transactions, and E as the set of edges, defined as \(E = \{<a,t>: a \in V_a ~ and ~ t ~ \in ~ V_t \} ~\bigcup ~\{<t,a>: t \in V_t ~ and ~ a ~ \in ~ V_a \}\). Here, each transaction, t, can be thought of as an AND vertex, and an address can be thought of as an OR vertex. In Fig. 14.2b, circle nodes represent the addresses and the square nodes represent the transactions. The amount of the payment can be stored on each edge.

Fig. 14.2
figure 2

Distributed data structures of (a) account-based and (b) UTXO-based blockchain graphs

Since our graph system keeps transaction graph in a partitioned fashion, this means the data structures are distributed over the processors of a High Performance Computing (HPC) cluster. Figure 14.2 also illustrates the partitioning of a simple example transaction graph by coloring with blue, green, and purple colors the addresses and the transactions that are stored in a distributed fashion on each of three processors, P0, P1, and P2.

4 BDVA Reference Model of the Blockchain Graph Analysis System

Big Data Value Association has come up with a reference model [15, p. 37] to position Big Data technologies on the overall IT stack. Figure 14.3 shows where the components of the graph analysis system are located on the BDVA reference system.

Fig. 14.3
figure 3

Components of the blockchain graph analysis system located on the BDVA reference model

At the lowest level of the stack, data about crypto-asset movements (cryptocurrencies Bitcoin and Ethereum and ERC20 token transfers) are obtained from blockchains. Note that in this chapter, we only report results for the Bitcoin and Ethereum datasets. We do not currently have real-life token transfer datasets for Hyperledger. We use a cloud-based HPC cluster computer that is managed by StarCluster [20] and programmed using MPI [21]. The transactions are parsed and cryptocurrency and token transactions are extracted and put in files. Since our datasets are public and no identities are attached to the addresses, the data protection layer is empty. The data processing architecture layer constructs in-memory distributed data structures for the transaction graph. This is done in a parallel fashion. Graph partitioning software like Metis [22] can be used to produce load-balanced partitions with small partition boundaries so that we have low communication costs during parallel running of the various graph algorithms. The Data Analytics layer contains parallel graph algorithms that are covered in this chapter and presented in the results, Sect. 6. In the future, this layer will also contain machine learning algorithms which are currently not implemented. Graph algorithms part, however, enables us to compute various features (node degrees, pagerank, shortest paths from blacklisted addresses, etc.), which in the future can provide data to machine algorithms. The topmost layer provides an interface, currently, through a message queue to Python programs which can implement visualization of subgraphs that contain tracing information to blacklisted addresses. In the next section, we present parallel fraudulent transaction tracing algorithm that is used.

5 Parallel Fraudulent Transaction Activity Tracing

Given a blockchain address of a customer and a set of blacklisted addresses, we want to identify a subgraph that traces transaction activities between the blacklisted addresses and the queried customer address. Figure 14.4 shows two example trace subgraphs: (a) for the DragonEx Hacker [23] on the Ethereum blockchain and (b) for the July 15 Twitter Hack [24] on the Bitcoin blockchain. Such trace subgraphs are returned by the parallel tracing algorithm that is presented in this section.

Fig. 14.4
figure 4

(a) DragonEx Hacker (Ethereum) [23] and (b) July 15 Twitter Hack (Bitcoin) [24] trace subgraphs returned by the parallel tracing algorithm

Algorithm 1 Parallel blacklisted address transaction tracing: part 1

Algorithm 2 Parallel blacklisted address transaction tracing: part 2

The parallel blacklisted address transaction tracing algorithm is given in Algorithms 1 and 2. The algorithm finds the subgraph between two sets of nodes. We use it to trace the set of queried addresses Q back to the set of blacklisted addresses B. The algorithm first finds the nodes reachable from B denoted with RB by depth-first traversal (line 2). Then, it finds the nodes that can reach Q denoted with RQ by traversing the edges in the reverse direction (line 3). The intersection of RB and RQ gives us the set of nodes in the subgraph V ′p (line 4). We use distributed data structures, and the superscript p denotes the part of the data on processor p. Each local set of nodes of the subgraph is exchanged between the processors, and we get the global set of nodes V in all processors (line 5). Each process finds the local edges whose source and target are in the V and is between the given block range from T s to T e (which also represents the time since all of the transactions in a block have the same timestamp) (line 6). T s is the block range start and denotes the start time range of trace. T e is the block range end and denotes the end time range of trace. The algorithm returns the subgraph (line 7).

The depth-first search function takes current processor ID p, total processor count P, the graph G p, set of starting addresses F, block start T s, block end T s, and traversal direction Rev. It returns the set of nodes R p that are reachable from the set of starting nodes F. C p is the set of nodes that needs to be visited and is located on remote processors. M p is the set of remote nodes that are already visited. S p is the stack of nodes to be visited. At the start, S p contains the local nodes that are in F (line 17). The algorithm first traverses the local reachable nodes (lines 20–38) and then sends the remote nodes to be visited C p to corresponding processors (lines 40–52). This loop continues until there is no node left in the stack of any processors. TCN is the total number of nodes in C p in all processors. It is initialized with 1 to run the first loop (line 18). The nodes in the stack S p are traversed until the stack is empty. At each step, a node cur is taken from the stack, marked as visited (lines 21–22). The nodes at the tail (head) of the outcoming (incoming) edges of the node cur are the traversed (if Rev is true) (lines 23–37). If the edge is between the given block time range, it needs to be visited (line 29). If it is a local node and is not already visited, it is added to the stack (line 30). If it is a remote node and not already sent to other processors to be visited, it is added to the set of remote nodes to be visited C p (line 33). The total number of nodes in C p in all processors is calculated (line 39). If there are any nodes that need to be sent to remote processors and visited in any of the processors, the nodes C p are exchanged between processors. The C p is added to M p not to send them again (line 49). The C p is cleared. The received nodes are added to the stack S p.

6 Tests and Results

In order to test our blockchain graph analysis system, we have set up an Amazon cloud-based cluster using the StarCluster cluster management tools. Our cluster had 16 nodes (machine instances). Ethereum tests used c5.4xlarge machine instances, each of which had 16 virtual CPUs and 32 GiB memory. Bitcoin tests used r5b.4xlarge machine instances, each of which had 16 virtual CPUs and 128 GiB memory. As datasets, we have used roughly nine years of Bitcoin and five years of Ethereum blockchain data. The full details of transaction data used are given in Table 14.3. Note that our Ethereum dataset also contains major ERC20 token transfer transactions. The Ethereum dataset is publicly available and can be downloaded from the Zenodo site [25]. We have also collected from the Internet various blacklisted blockchain addresses that have been involved in ransomware, scam-related initial coin offerings, and hacked funds. For example, the source [26, 27] provides a list of Bitcoin addresses involved ransomware and sextortion scams.

Table 14.3 Bitcoin and Ethereum blockchain data statistics used in tests

Table 14.4 shows various computations carried on the Bitcoin and Ethereum datasets. Detailed timings of Ethereum tests were previously reported in an earlier publication [13] on a slightly smaller dataset. Computations on the Bitcoin transaction graph are first reported in this book chapter. The tests were run with node sizes 4, 8, 12, and 16 and with 1, 2, 4, 8, 12, and 16 MPI processes per node. In addition, Table 14.4 shows the best timing obtained and the corresponding number of nodes and the total number of MPI processes for the run.

Table 14.4 Description of tests and best timings obtained for Bitcoin (btc) and Ethereum (eth) blockchain data

It is important to note here that if the graph analysis system that we developed was not distributed memory model based, we would have problems fitting the whole graph on a node. In fact, even the distributed memory system can also run into out-of-memory problems when using a small number of nodes (for example, Bitcoin graph did not fit on 4 nodes). The good thing about distributed memory system is that one can simply increase the number of nodes, and then since the graph is partitioned, the partitions will get smaller and fit on the nodes of the cluster. This is how we have been able to run our analyses on large transaction graphs. This is also how we will continue to be able to run our analyses in the future when blockchain graph sizes grow drastically due to changes in the throughputs of the blockchains as given in Table 14.3. Distributed memory allows us to achieve scalability as far as graph storage is concerned. There is, however, another scalability issue that we should be concerned about and that is the scalability of our computational processes. When the number of processors is increased, communication among processors increases. Since communication is expensive, it may lead to execution time increase. The overall execution time should decrease or if growing, it should grow slowly. In general, once the transactions are loaded from disk and graph is constructed (i.e., Test T1), the computational tests, (T2...T9), execution times decrease, levelling off after increasing the number of processors beyond some point. This is as expected because even if the work per processor decreases, the communication cost increases.

Since graph construction (Fig. 14.5) involves reading transactions from disk and then uses a ring communication (see the algorithm in [13]) to implement an all-to-all communication in order to construct the graph, we can expect its timing to level off and even increase after the number of nodes is increased. Figure 14.5 shows this happening. In the case of Ethereum, the timing levelled off. For Bitcoin, since the number of addresses is much higher, we see the times decrease and then increase due to increased communication cost.

Fig. 14.5
figure 5

Test T1: Graph construction times for Bitcoin (a) and Ethereum (b) on the HPC cluster. The label T1-btc(i) and T1-etc(i) means i virtual processors (MPI processes) per node of the cluster

Figure 14.6 shows the execution times for the pagerank computation. Note that after graph construction, the graph resides in memory and the pagerank computations are run on the distributed data structures. We see faster processing times when the number of processors is increasing.

Fig. 14.6
figure 6

Test T3: Page ranking times for Bitcoin (a) and Ethereum (b) on the HPC cluster. The label T1-btc(i) and T1-etc(i) means i virtual processors (MPI processes) per node of the cluster

In Table 14.5, we show analysis results obtained from pagerank computation. The first most important addresses are shown in this table. The most important addresses belong to exchange companies and popular gambling and betting sites. Note that pagerank computation helps us to find important addresses such as those of exchange companies or other popular companies. In particular, exchange companies are required to do identity checks on customers. Customers who deposit or withdraw from such sites are very likely to have their identities verified. Hence, even though we do not know the actual identities of people owning addresses, we can infer that their identities have been verified. The pagerank information can be used as a feature in machine learning algorithms for scoring purposes in the future. This is one of the motivations for computing pagerank in our graph analysis system.

Table 14.5 Top 10 ranked addresses on Bitcoin and Ethereum transaction graph

In Table 14.6, we have also given statistics on the total number of distinct addresses at the tail/head of incoming/outcoming transactions to/from k most important addresses for the datasets. Given the 800 and 79 million addresses, respectively, in the Bitcoin and Ethereum datasets we have used, what fraction of these addresses directly transacted with the k most important addresses? Since Bitcoin is UTXO based, it is expected that address reuse is seldom, and hence we have lower percentages. On the other hand, since Ethereum is account based, address reuse is more frequent, and the fractions are higher. The fact that about half (48%) of the addresses directly transacted with the most important 1000 addresses, efforts can then be concentrated on this small set of important addresses to see if their identity verification procedures are strong. If they are strong, then lower risk scores can be assigned to the addresses who transacted with them.

Table 14.6 Total number of distinct addresses at the tail/head of incoming/outcoming transactions to/from k most important addresses for the datasets given in Table 14.3

7 Discussion and Conclusions

In this chapter, we have covered the blockchain transaction graph analysis system we are currently developing, its architecture, and the results obtained from running various parallel graph algorithms on the massive Bitcoin and Ethereum blockchain datasets. Some graph algorithms such as pagerank and forest of tracing trees operate on the whole graph. This approach then introduces the problem of dynamically growing blockchain datasets not fitting the memory of a single node. Our parallel and distributed approach solves this (i) single node memory bottleneck problem as well as (ii) speeds up the computations. The work presented and the results obtained in this chapter actually demonstrate that we achieve both, (i) and (ii) on the implemented graph algorithms.

The computing infrastructure that is needed to run such analyses is available readily to everyone. A low-cost cluster can be started on a cloud provider like Amazon. The software libraries that we have used, i.e., MPI, for development are open source and free to use. Therefore, our system can be run even by small businesses with very low investments. In fact, a cluster can also be set up easily using local machines on a local area network in the premises of a company and MPI can run on this cluster.