Keywords

1 Introduction

Ethereum is the second most valuable cryptocurrency which has been widely used in various areas. In Ethereum blockchain, the transaction data is publicly stored on all Ethereum full nodes, which means any node can access the complete information from the chain. Unfortunately, malicious users can also analyze it to infer correlations between addresses and even the identity of other users with the help of background knowledge.

With the increasing attention to privacy protection, more and more users are trying to hide the association between their own addresses using privacy preservation mechanisms such as coin mixing [3, 8,9,10,11,12, 15].

Tornado Cash is one of the most popular coin mixing tools on Ethereum, which had nearly 1.6 million Ether flowing into Tornado coin mixing contracts worth over 2.4 billion in USD. And its handling fee has reached a high of $2.82 millionFootnote 1. Users on Tornado Cash only need to invoke the relevant smart contract to complete the coin mixing process. It provides huge convenience for users to enhance the privacy of address linkability. However, some inappropriate use has made it much less effective and exposed some security concerns.

This paper presents the first heuristic address correlation clustering approach based on the user’s behaviour in the Tornado Cash coin mixing scenario, analyses the vulnerability of Tornado Cash, and performs experimental analysis on the transaction set.

In summary, the contributions of this paper include:

  • We first formally analyze the correlation of transactions on the Tornado Cash coin mixing service and systematically summarize the behavior patterns of users in Tornado.

  • We propose three heuristic clustering rules to achieve address correlation for Tornado coin mixing transactions based on the time interval features behind the proposed two types of transaction patterns.

  • We perform the experimental analysis on the real-world transaction dataset in Tornado Cash. The results prove the feasibility and effectiveness of the proposed heuristic clustering rules.

The remainder of the paper is organized as follows: In Sect. 2, we briefly review the related work on Ethereum address clustering. Section 3 presents the background knowledge of the proposed approach. Section 4 gives formal definitions of transactions and summarises transaction patterns based on the analytics of data. Section 5 proposes heuristic clustering rules and discusses the experiment results. Section 6 concludes this paper and describes the future work.

2 Related Work

In recent years, researchers pay more and more attention to analyzing Ethereum privacy. However, existing analyses are still limited and mainly explore the privacy of Ethereum users in terms of address correlation.

In general, address correlation methods on Ethereum involve two major categories. One is using machine learning and node embedding methods to cluster transaction behaviour patterns or user accounts with similar characteristics. Sun et al. [13] first applied the node embedding algorithm to the clustering of Ethernet accounts. Hu et al. [7] designed a transaction-based classification detection method for Ethereum smart contracts by summarizing contract transaction behavior patterns. Bhargavi et al. [2] analyzed the Ethereum transaction information to infer the behaviour characteristics in supervised and unsupervised environments.

Another category is to use heuristic or graph-based clustering algorithms to link addresses that participated in certain transactions. Chan et al. [4] first explored the feasibility of using graph analysis to link Ethereum addresses. Chen et al. [6] created the finding entity algorithm between tokens on graph analysis. Chen et al. [5] analyzed the contract graph to cluster multiple smart contract accounts controlled by the same entity using weakly connected components. Victor [14] proposed heuristic address clustering rules based on users’ behaviors in airdrops, ICOs and exchanges.

The existing work on analyzing the vulnerability of coin mixing service on Ethereum is Béres et al. [1]. They obtained multiple ground truth sets through heuristic rules for coin mixing service to evaluate other clustering algorithms. Nevertheless, in actual transactions, the same custom-set gas prices where the last 9 digits are non-zero are generally initiated by the same addresses. Hence, their proposed rules based on gas prices are not of practical significance.

Based on the above investigation, there is no systematic study of the address clustering that takes coin mixing services on Ethereum into account.

3 Preliminaries

This section introduces the basics of Tornado Cash and outlines the coin mixing principles and processes in Tornado Cash.

3.1 Basics of Tornado Cash

Tornado Cash is a kind of smart contract in Ethereum that uses Zero-Knowledge Succinct Non-Interactive Argument of Knowledge (zk-SNARK) to achieve the unlinkability between addresses that belong to the same users, and protect their privacy in a trustless manner. This paper mainly takes ETH as an example to analyse the Ethereum transactions on Tornado Cash. To avoid address linking by unique value characteristics, Tornado deployed four smart contracts with different denominations to implement the coin mixing services of the fixed value. The detail information is shown in Table 1.

Table 1. Tornado Cash about ETH.

3.2 Coin Mixing Process in Tornado Cash

Users are required to complete the coin mixing in two steps: deposit and withdraw. As shown in Fig. 1, the dotted line denotes the contract invocation, and the solid line denotes the funds flow.

Fig. 1.
figure 1

(Color figure online)

The process of Tornado Cash coin mixing contract.

The user uses his address a to invoke the contract \(SC_{N.ETH}\) and create an N.ETH deposit transaction, then he will obtain a deposit note. After a period of time, the user utilizes the previous deposit note to create an N.ETH withdraw transaction, thus the contract returns the previously deposited cryptocurrency to address b. The refund address can be any address the user wants to remit funds to, or even an unused address that was originally unfunded. This kind of unused addresses does not have enough Ether to cover the gas fee, in this case, the user can send a request to relayer r with the required parameters, relayer then create a withdraw transaction upon the request as shown by the red connecting line in Fig. 2. The relayer r uses a fraction of f to pay the gas fee. When the contract validates the parameters in the withdrawal transaction successfully, it sends \((N-f)\) Ether to address b and sends f to Relayer r.

Because of the zero-knowledge property in zk-SNARK, Tornado guaranteed that the two transfers are completely independent. Furthermore, a deposit transaction corresponds to only one withdrawal transaction.

4 Analysis of Tornado Cash

The overview of our methodology architecture is shown in Fig. 2, which is divided into three steps: data acquisition, data analysis and cluster, and data presentation. Data acquisition includes the acquisition of related Ethereum transactions, the decoding of individual fields and the deletion of useless fields; data analysis includes statistical analysis of transaction data followed with the experimentation using proposed clustering rules; data presentation part presents the address clustering results.

Fig. 2.
figure 2

Overview of our methodology architecture.

4.1 Definitions

This section describes our formal definition in Tornado Cash.

Definition 1

(Transaction). Transaction is represented as \(\mathbf {tx}= \mathbf {\{hash}\), \(\mathbf {from}\), \(\mathbf {to}\), \(\mathbf {value}\), \(\mathbf {input}\), \(\mathbf {ts\}}\), where:

  • \(\mathbf {hash}\) is the hash value of \(\mathbf {tx}\);

  • \(\mathbf {from}\) is the address that creates \(\mathbf {tx}\);

  • \(\mathbf {to}\) is the target address of the transaction \(\mathbf {tx}\). Specially, in the smart contract invocation transaction, \(\mathbf {to}\) is represented the address of the smart contract;

  • \(\mathbf {value}\) is the value of \(\mathbf {tx}\);

  • \(\mathbf {input}\) is invoke parameter when \(\mathbf {tx}\) is a smart contract invocation transaction;

  • \(\mathbf {ts}\) is the timestamp representing the time when \(\mathbf {tx}\) was packaged on.

The set of transactions is represented as \(\mathcal {TX} = \mathbf {\{tx_1, tx_2,\ldots ,tx_n\}}\).

Definition 2

(Deposit Transaction). The deposit transaction set is represented as \(\mathcal {D}=\mathbf {\{d_1,d_2,\ldots ,d_n\}}\subseteq \mathcal {TX}\), \(\forall \) \(\mathbf {d_i}\in \mathcal D\), \(\mathbf {d_i.input=(commitment)}\), where the \(\mathbf {commitment}\) field is the parameter used for the zk-SNARK proof.

Definition 3

(Withdraw Transaction). The withdraw transaction set is represented as \(\mathcal {W}=\mathbf {\{w_1,w_2,\ldots ,w_n\}}\subseteq \mathcal {TX}\), \(\forall \mathbf {w_i}\in \mathcal W\), \(\mathbf {w_i.input=(proof}\), \(\mathbf {nullifi}\)- \(\mathbf {erHash}\), \(\mathbf {recipient}\), \(\mathbf {relayer}\), \(\mathbf {fee}\), \(\mathbf {refund)}\), where \(\mathbf {proof}\), \(\mathbf {nullifier}\)\(\mathbf {Hash}\) are the parameters used for the zk-SNARK, \(\mathbf {recipient}\) is the target address for receiving the withdrawal funds, \(\mathbf {relayer}\) is the address of the relayer, \(\mathbf {fee}\) is the transaction fee given to the relayer, and \(\mathbf {refund}\) is the parameter related to the refund.

  • if \(\mathbf {w.input.fee\not =0}\) and \(\mathbf {w.input.recipient} \not = \mathbf {w.from}\), then \(\mathbf {w}\) is a withdraw transaction using the relayer, and the true withdraw target address is \(\mathbf {w.input.recipient}\).

  • if \(\mathbf {w.input.fee = 0}\) and \(\mathbf {w.input.recipient} =\mathbf {w.from}\), then \(\mathbf{w}\) is a withdraw transaction not using the relayer.

4.2 Data Acquisition

Transactions data related to the four ETH denominations of the Tornado Cash coin mixer contracts are obtained by the Etherscan APIFootnote 2.

We use the Error field to classify the transaction into Success Transaction and Error Transaction. The Error Transaction is categorized into Out of gas and Reverted according to the error type. For the Success Transaction, we decoded the input field using Contract ABI provided by Etherscan, and categorized transactions into Deposit, Withdraw and Other. On the basis of the above classification, we further removed the useless fields and stored the transactions in the form of tx associate with type.

Table 2 shows the categories and the corresponding number of transactions for each denomination of Ether in Tornado mixer after our processing since the deployment of the Tornado Cash in 2019 as of May 17, 2021. In addition to deposit and withdrawal transactions for coin mixing, there are a small number of failed transactions, as well as individual transactions related to contract creation, etc.

Table 2. Tornado Cash transactions details about ETH.

Figure 3 shows the percentage of transactions for each denomination of Tornado mixer. As is illustrated in Fig. 3, the 10ETH mixer has the largest number of transactions, while the 100ETH mixer has the least.

4.3 Transaction Patterns

From the analysis of the transaction data, we found a special phenomenon that several transactions were created within a short period of time \(\delta \) (called a small transaction set \(\mathcal {TX}_i\)). The time interval \(\varDelta \) between different \(\mathcal {TX}_i\) will be much larger than the time interval \(\delta \) in \(\mathcal {TX}_i\) internally. Besides, during the data processing, we discovered that several users in Tornado used the same addresses to deposit and withdraw. Compared their transaction time intervals, the transactions in a small set are created by the same users. In other words, a user tends to create transactions within a short period of time. This phenomenon infers that diverse addresses of transactions within the small transaction set may be controlled by the same user.

Fig. 3.
figure 3

Proportion of Tornado mixer transactions by denomination.

From our analysis, we summarize two kinds of the transaction behavior patterns of users, as defined below:

Definition 4

(Single Deposit-Withdraw Coin Mixing Pattern). The user initiates a deposit transaction \(\mathbf {d}\) of N.ETH using address a. After an interval \(\delta \), the user creates a withdraw transaction \(\mathbf{w}\) of N.ETH using address b. The above pattern is defined as \(\mathbf {pattern}\; \mathbf {I}\) : \(\mathbf{d},\mathbf{w},\delta \) , where \(\delta =\mathbf {w.ts}-\mathbf {d.ts}\), and \(\mathbf{d}\), \(\mathbf{w}\) satisfy the following conditions:

  • \(\mathbf {d.from}=a\);

  • \(\mathbf {w.input.recipient}=b\);

  • \(\mathbf {d.to}=\mathbf {w.to}\).

Definition 5

(Multi-Deposit and Multi-Withdraw Coin Mixing Pattern). The user creates n (\(\boldsymbol{n}\) \(\ge \) 2) deposit transactions \(\mathcal {D}=\mathbf {\{d_1,d_2,\ldots ,d_n\}}\) of N.ETH using the address set \(\mathcal {A}=\{a_1,a_2,\ldots ,a_n\}\). After an interval \(\varDelta \), the user create n withdraw transactions \(\mathcal W = \mathbf {\{w_1,w_2,\ldots ,w_n\}}\) of N.ETH using address set \(\mathcal {B}=\{ b_1,b_2,\ldots ,b_n\}\). The above pattern is defined as \(\mathbf {pattern} \; \mathbf {II}\): \(\delta _d\), \(\mathcal D\), \(\delta _w\), \(\mathcal W\),\(\varDelta \), n>, where \(\delta _d = max \mathbf {\{} \mathbf {d_{i+1}.ts-d_i.ts} | \mathbf {d_i}, \mathbf {d_{i+1}}\in \mathcal D\mathbf {\}}\), \(\delta _w=max \mathbf {\{}\) \(\mathbf {w_{i+1}.ts-w_i.ts}|\mathbf {w_i},\mathbf {w_{i+1}}\in \mathcal W\mathbf {\}}\), \(\varDelta = \mathbf {w_1.ts}-\mathbf {d_n.ts}\), and transactions in \(\mathcal D\), \(\mathcal W\) satisfy the following conditions:

  • \(\forall \mathbf {d_i}\in \mathcal D\), \(\mathbf {d_i.from}\in \mathcal A\);

  • \(\forall \mathbf {w_i}\in \mathcal W\), \(\mathbf {w_i.input.recipient}\in \mathcal B\);

  • \(\forall \mathbf {tx_i},\mathbf {tx_j} \in \mathcal D\cup \mathcal W\), and \(i \not = j\), \(\mathbf {tx_i.to}=\mathbf {tx_j.to}\).

Based on the above summaries of transaction patterns, and the fact that some users have the mentality of relying entirely on the tool. They think that Tornado Cash can achieve the address unlinkability without any consideration, therefore eager to withdraw immediately after deposits. We can analyze the time interval among coin mixer transactions to further link the diverse addresses owned by the same user.

5 Heuristic Cluster Rules

In this section, we perform the statistical analysis to propose the heuristic cluster rules linking the addresses belonging to the same user.

5.1 Heuristics

Figure 4 is the statistical results of the time intervals for previous identified users who used the same address to initiate coin mixing transactions based on the transaction patterns defined in Sect. 4.

Fig. 4.
figure 4

The interval time statistic based on the transaction patterns.

As shown in the green block in Fig. 4, the time interval \(\delta \) in pattern \(\mathbf {I}\), where \(\mathbf {d}\) and \(\mathbf {w}\) are created by the same user, is basically no more than 180s. It indicates that there is a subset of users in the Tornado mixer prefer to deposit and withdraw within a short period of time. This interval \(\delta \) is much smaller than the average 2 h transaction interval that is common in the Tornado transaction set.

figure a

Also, when analyzing the time interval in pattern \(\mathbf {I}\), it was found that the same user may create multiple tuples \(\mathbf{d},\mathbf{w},\delta \) successively. Their interval distribution is shown in the yellow box in Fig. 4.

\(\delta _{dw}\) denotes the maximum \(\delta _i\) in the multiple transaction tuples { \(\mathbf{d}_1,\mathbf{w}_1,\delta _1\) , \(\mathbf{d}_2, \mathbf{w}_2,\delta _2\) , \(\ldots \), \(\mathbf{d}_n,\mathbf{w}_n,\delta _n\) }, which is presented as the pure yellow block in Fig. 4; \(\delta _{wd}\) denotes the maximum time interval between each \(\mathbf{d},\mathbf{w},\delta \) tuples, which is presented as the dotted yellow block in Fig. 4.

It can be seen that the \(\delta _{dw}\), \(\delta _{wd}\) of multiple \(\mathbf{d},\mathbf {w},\delta \) tuples are larger than the time interval \(\delta \) of a single \(\mathbf{d},\mathbf{w},\delta \) tuple, about 20 min.

figure b

The distribution of the time intervals \(\delta _d\), \(\delta _w\) and \(\varDelta \), for the same user in pattern \(\mathbf {II}\) \(\delta _d , \mathcal D , \delta _w\), \(\mathcal W, \varDelta \), n> is shown in the pink blocks in Fig. 4. The pink slash blocks and vertical blocks are the maximum time interval between deposit transactions \(\delta _d\) and withdraw transactions \(\delta _w\), respectively; the pure pink block is the time interval \(\varDelta \). It seems that users tend to create a series of \(\mathbf{d}\) in \(\mathcal D\), and a series same amount of \(\mathbf{w}\) in \(\mathcal W\), with a small \(\delta _d\) and \(\delta _w\). However, the time interval \(\varDelta \) between \(\mathcal D\) and \(\mathcal W\) is generally longer and more irregular.

figure c

In particular, the interval threshold \(\varDelta \) between \(\mathcal D\) and \(\mathcal W\) is related to the number of transactions n. The number of transactions n is increasing while the number of pattern \(\mathbf {II}\) transactions is decreasing, then the interval threshold between \(\mathcal D\) and \(\mathcal W\) should be raised appropriately. In the sense that the threshold \(\varDelta \) is proportional to the number of transactions n.

5.2 Evaluation

We implement a proof-of-concept for the proposed three clustering rules on the Tornado coin mixer and make thorough experiments to verify the effectiveness of our rules. The program is written in the Python language and run in the Python 3.6 environment based on the Windows 10 OS, with 2.5 GHz Intel Core i5-7200U CPU and 12 GB RAM.

The results of the experiment are shown in Table 3, where userNum and addrNum represents the number of clustered user entities and the number of addresses clustered in total, respectively.

Table 3. Result of Heuristic 1–3.

As can be seen in Table 3, Heuristic 1 has the largest number of associative clusters and clustered addresses, while the denomination 0.1ETH having the highest number, reaching 1073. After combining all the clustered results, we eventually obtain 2734 addresses related to 1168 user entities. In Heuristic 2 and Heuristic 3, the highest number of associations clusters for transactions is 1ETH and 10ETH. The 100ETH mixer has the highest degree of clustering in Heuristic 2, with an average of 3.6 addresses per user entity.

In the experiments, we notice a particular phenomenon that the destination address \(\mathbf {w.input.recipient} \not = \mathbf {w.from}\), and \(\mathbf {w.input.fee} = 0\). If the relayer forwards the withdraw transaction, it is unreasonable for him to pay gas fee in advance without any forwarding fee. Thus we can infer that the two addresses are likely to be controlled by the same entity. There are a total of 95 user entities with the above case in four mixers, containing 566 related addresses.

The experiment reveals the fact that the users’ behaviors hinder the achievement of their desired privacy protection. Indeed, a large proportion of users think using coin mixing tools can unconditionally protect their privacy. Unfortunately, the short time interval of deposing/withdrawing the coins exposes the users’ transaction patterns that leaks the linkability of addresses they controlled. Therefore, we suggest users avoid immediately withdrawing operations after depositing their funds and prevent multiple deposits & withdrawals of the same size with one address for better privacy concerns.

6 Conclusion and Future Work

This paper presents the first systematic analysis of Tornado Cash on privacy issues. A macro analysis of the transaction in the Tornado Cash ETH coin mixer is performed. Based on the transaction time interval, two transaction patterns are formalized and three heuristic address clustering rules are proposed. The experimental results indicate that the presented methodology can reveal the address linkability in the Tornado Cash ETH coin mixer.

In future work, we can also apply the proposed methodology to other tokens with different transaction patterns.