1 Introduction

A blockchain provides a secure and decentralized infrastructure for the execution and record-keeping of digital transactions. A programmable blockchain platform supports smart contracts, which are stateful, general-purpose computer programs that enable the development of blockchain-powered applications. Ethereum is currently the most popular programmable blockchain platform. In Ethereum, these blockchain-powered applications are known as decentralized applications (ÐApps).

Transactions are the means through which one interacts with Ethereum. Ethereum defines two types of transactions: regular transactions and contract transactions. Regular transactions enable transfers of cryptocurrency (Ether, or simply ETH) between end-user accounts. In turn, contract transactions invoke functions defined in the interface of a smart contract. When an end-user triggers a certain functionality on the frontend of a ÐApp (e.g., checkout the items in the shopping cart of an online shopping application), such a request is translated into one or more blockchain contract transactions. ÐApps themselves facilitate these transactions by submitting them to the blockchain (e.g., Ethereum). That is, end-users are completely unaware that a blockchain is in use, as transactions are handled entirely by the ÐApp.

In Ethereum, all transactions must be paid for, including contract transactions. However, transaction prices are not prespecified. Transaction issuers need to decide how much they wish to pay for each transaction by the assignment of the gas price parameter, paid in Ether (ETH). The higher one pays for a transaction, the faster it is generally processed. This is because there is a financial incentive that mining nodes (i.e., those who choose, prioritize, and process all transactions on Ethereum) receive for transactions that they process and verify. This incentive is a function of the gas prices that are associated with the transactions they choose to process. As a result, the execution of code on programmable blockchains such as Ethereum takes time, as the corresponding transaction(s) must first be chosen by the mining node that will append the next block to the blockchain. Consistently setting up transactions with high gas prices is not an option, since it would render the ÐApp economically unviable. Hence, for a ÐApp to be profitable, it needs to issue transactions with the lowest gas price that fulfills a predefined Quality of Service (QoS) (e.g., transactions processed within 1 minute in 95% of the cases). To achieve such a goal, ÐApp developers make use of processing time estimation services.

Although existing processing time estimation services are useful, they offer little to no insight into what drives transaction processing times in Ethereum (except for gas prices). This happens either because the underlying machine learning model that is used by these services is a complete black-box (e.g., Etherscan’s Gas TrackerFootnote 1) or because the service does not clearly indicate the features that drive the model’s decisions (e.g., EthGasStationFootnote 2). Ultimately, the processing time estimations made by these services impact the gas price choices made by ÐApp developers for their transactions. We thus argue that these services should be interpretable. Without explanations or insights into the underlying models, the estimations provided by these services lack reliability, robustness, and trustworthiness. In practice, this means ÐApp developers struggle to define proper gas prices and manage transaction submission workloads (e.g., decide the best time to submit a given set of transactions). An in-depth investigation of the performance of these popular estimation services can be found in our prior work (Pacheco et al. 2022).

Building an explainable transaction processing time model entails discovering and understanding the factors that influence (or are at least correlated with) processing times. As a first step towards addressing this challenge, in this study we set out to determine and analyze the factors that are associated with transaction processing times in Ethereum. In the following, we list our research questions and the key results that we obtained:

RQ1: How well do blockchain internal factors characterize transaction processing times? To answer this RQ, we first build a random forest model with an extensive set of features which capture internal factors of the blockchain. (e.g., the difficulty to mine a block, or the characteristics of the contracts and transactions) of the Ethereum blockchain. These features are derived directly from the contextual, behavioral, and historical characteristics associated with 1.8M transactions. Next, we interpret the model using adjusted R2 and SHAP values.

Our model achieves an adjusted R 2 of 0.16, indicating that blockchain internal factors explain little of the variance in processing times.

RQ2: How well does gas pricing behavior characterize transaction processing times? Similar to the stock market, the Ethereum blockchain ecosystem is impacted by a plethora of external factors, including speculation, the global geopolitical situation, and cryptocurrency regulations. Many of these factors are largely unpredictable and difficult to be captured in the form of engineered features. Yet, all these factors are known to impact the gas pricing behavior of a large portion of transaction issuers. Hence, in RQ2 we build a model that aims to account for the gas pricing behavior of transaction issuers (i.e., the mass behavior). We do so by engineering additional features that indicate how the price of a given transaction compare to that of recently submitted transactions.

Our model achieves an adjusted R 2 score of 0.53 (2.3x higher than the baseline model), indicating that it explains a substantially larger amount of the variance in processing times. The most important feature of our model is the median percentage of transactions processed per block, in the previous 120 blocks (~30min ago), with a gas price below that of the current transaction.

Paper organization

The remainder of this paper is organized as follows. Section 2 defines the key concepts employed throughout this study. Section 3 provides a motivating example that emphasizes the importance having explainable transaction processing time predictions. Section 4 describes our study methodology, including the data collection, our engineered feature-set, and the model construction. Sections 5 and 6 show the motivation, approach, and findings associated with the research questions addressed in this paper. Section 8 discusses related work. Section 9 discusses the threats to the validity of our study. Finally, Section 10 states our concluding remarks.

2 Background

In this section, we define the key concepts of the Ethereum blockchain that are employed throughout our study.

Blockchain

A blockchain is commonly conceptualized as a ledger which records all transaction-related activity that occurs between peers on the network. This activity data is stored through the usage of blocks, which is a data structure that holds a set of transactions. Each of these transactions can be identified by an identifier called the transaction hash. Blocks are appended to one another to form a linked list type structure. A block stored in the blockchain becomes immutable once a specific number of blocks has been appended to that block. This is because for one to change the information within a specific block, they must also change all blocks appended after it as well.

Transactions

Transactions allow peers to interact with each other in the context of a blockchain. There are two types of transactions used in the Ethereum blockchain: user transactions and contract transactions. User transactions are employed to transfer cryptocurrency (known as Ether in Ethereum) to another user, who is identified by a unique address. Contract transactions allow users (e.g., ÐApp developers) to execute functions defined in smart contracts. Smart contracts are immutable, general-purpose computer programs deployed onto the Ethereum blockchain.

Gas and transaction fees

Transactions sent within the Ethereum blockchain require a transaction fee to be paid before being sent. This fee is calculated by gas usage× gas price. The gas usage term refers to the amount of computing power consumed to process a transaction. This value is constant for user transactions, costing 21 GWEI, equivalent to 2.1E-8 ETH. For contract transactions, this value is calculated depending on the bytecode executed within a smart contract, as each instruction has a specific amount of gas units required for it to execute (Wood 2019). The gas price term defines the amount of Ether the sender is prepared to pay per unit of gas consumed to process their transaction. This parameter is set by the sender, and allows users to incentivize the processing of their transaction at a higher priority (Signer 2018), as miners which process the transaction are rewarded based on the transaction fee value.

Transaction processing lifecycle

The lifecycle of a transaction transitions through various states, which are illustrated in Fig. 1. The first state of the transaction processing lifecycle is when a user first submits their transaction t to the Ethereum blockchain (submitted state). Next, this transaction t is broadcasted through the network by a peer-to-peer based communication method. Once the first node discovers transaction t, the transaction enters the pending state as it enters the pending pool of transactions. As time passes this transaction t is continuously discovered by an increasing amount of nodes deployed on the network, which track and record all transaction related activity, including those that are pending. Mining nodes will compete to win the Proof-of-Work (PoW) (Jakobsson and Juels 1999), which will allow the winner to append their new block b to the blockchain. If transaction t has been processed and recorded in block b, transaction t then transitions into the processed state (i.e., at the block timestamp of b). Transaction t will transition into the next state, confirmed, once a specific number of blocks n has been appended after b. There is no universal agreement on what the value of n should be. For instance, n = 7 is stated in the Ethereum whitepaper (Buterin 2014). Considering that blocks are appended every 15s on average, n = 7 translates to approximately 1m 45s. Alternatively, some cryptocurrency exchanges define n as a higher value. For instance, CoinCentral uses n = 250, which translates to approximately 1h 2m 30s (Comben 2018).

Fig. 1
figure 1

Transaction lifecycle in the Ethereum blockchain

Proof-of-work

Proof-of-Work (PoW) is a consensus protocol that requires nodes to solve a difficult mathematical puzzle. This protocol ensures that the strategy used to solve the puzzle is optimal (i.e. no solution for the puzzle exists other than a brute force solution). Fortunately the verification of such a solution is much cheaper in terms of computation. Ultimately this allows the nodes, or entities within the network who do not know the authority of one another, to trust each other and build a valid transaction ledger without requiring a centralized third-party authority (e.g., a bank).

Next-generation ÐApps

Commonly, ÐApps expose the internal blockchain complexities to end-users, ultimately requiring that they install a wallet (e.g., Metamask). A wallet is a computer program (commonly a web browser extension) that enables one to submit transactions in a blockchain without having to set up their own node. However, submitting transactions with a wallet still requires end-users to set up transaction parameters. Such a task is error-prone by construction (Oliva and Hassan 2021), as not all end-users are familiar with blockchain concepts and their intricacies. Other problems include end-users getting confused with the user interface of wallets or even making typos while inputting the transaction parameter values. As an illustrative example, an Ethereum user called Proudbitcoiner spent 9,500 USD in transaction fees to send just 120 USD to a smart contractFootnote 3. The user said the following (gas limit and gas price are two transaction parameters and 1 GWEI = 1E-9 ETH):

“Metamask didn’t populate the ’gas limit’ field with the correct amount in my previous transaction and that transaction failed, so I decided to change it manually in the next transaction (this one), but instead of typing 200000 in “gas limit” input field, I wrote it on the “Gas Price” input field, so I paid 200000 Gwei for this transaction and destroyed my life.”

Due to incidents such as the one above, alongside end-user onboarding barriers, more recently ÐApp developers have started to engineer their applications using a new paradigm, which hides all blockchain complexities from end-users. In this new paradigm, the burden of setting up transaction parameters is on the ÐApp developers. ÐApp developers, in turn, can charge end-users with a flat fee (e.g., payable via credit card), such that they do not need to hold Ether or use wallets, ultimately easing their onboarding process. We refer to these ÐApps as the next-generation ÐApps (Oliva and Hassan 2021). Next-generation ÐApps are a reality and include ÐApps such as MyEtherWalletFootnote 4.

3 Motivating Example

Assume that some ÐApp development organization (e.g., a company) wants to convert their web application into an Ethereum ÐApp. In order to ensure a high quality end-user experience, the company will submit contract transactions itself (instead of pushing this burden to end-users) (Oliva and Hassan 2021). Since end-users will be unhappy if transactions take too long to be processed, the company defines the following quality of service (QoS): 90% of all transactions shall be processed within 5 minutes. Considering that transactions cost ETH to be sent, this QoS would be easily achievable if the company’s budget were infinite. Unfortunately, this is not the case for any real-world ÐApp development organization. In a real-world scenario, we assume that organizations aim to provide such a QoS while also ensuring that the gas prices for their transactions are as low as possible. We refer to this balance as the time-cost balance.

Organizations can leverage processing time estimation services to help them achieve a proper time-cost balance. For example, assume a developer (John) is chosen to implement the code that is responsible for submitting the ÐApp’s transactions. Just before submitting a transaction, John requests a lookup table from his favorite estimation service (e.g., EthGasStation). A lookup table contains processing time predictions for every possible value of gas price. This table is calculated by the estimation service based on recently processed transactions. John then attempts to implement an algorithm that processes the lookup table and selects the <gas price, estimated processing time> pair that most closely aligns with the predefined QoS (given the actual processing times of his previously submitted transactions). John quickly realizes that implementing such an algorithm is very hard and error-prone because estimation services offer little to no explanation for their predictions. For instance, assume that John needs to submit transactions shortly, but gas prices are high overall. Should John wait a little in the hope that prices will come down or should he submit the transactions right away? It is difficult to answer this question without knowing the explanations behind the predicted processing times. Assume that John knew that network congestion level was the key factor driving the predicted processing times shown in his current lookup table. In that case, John would be able to analyze the duration of past network congestions to make a more informed decision regarding the best time to submit his transactions. However, in reality, John does not even know if one or more factors are influencing transaction processing times. As a result, it becomes nearly impossible for John to design more advanced algorithms for transaction pricing. John actually struggles to manage transaction submission workloads and meet the predefined QoS. Ultimately, the lack of explainable estimation services limits John’s ability from finding the optimal time-cost balance for the company’s ÐApp.

The aforementioned motivating example shows that there is a need for some unified, free, and open-source alternative to the existing estimation services that would provide ÐApp developers such as John with explainable transaction processing time predictions. As a first step towards addressing this need, in this paper we study the factors that are associated with transaction processing times in Ethereum.

4 Methodology

We build a machine learning model with a comprehensive feature-set in order to determine the factors that are strongly associated with the processing times of transactions in Ethereum. In this section, we explain our approaches to data collection (Section 4.1), feature engineering (Section 4.2), and model construction (Section 4.3).

4.1 Data Collection

In this section, we describe the data sources that we used (Section 4.1.1) and the approach that we followed to collect the data (Section 4.1.2).

4.1.1 Data Sources

Our study relies on two data sources, namely Etherscan and Google BigQuery:

Etherscan

EtherscanFootnote 5 is a popular real-time dashboard for the Ethereum blockchain platform. Etherscan has its own nodes in the blockchain, which are used to actively monitor the activity of the network in real time. This includes monitoring various details regarding transactions, blocks, and smart contracts. We use the data from this website to obtain transaction processing times (our dependent feature) and to engineer several features used in our model (our independent variables, such as pending pool size and network utilization level). Data from Etherscan has been used in several blockchain empirical studies (Oliva et al. 2020; Oliveira et al. 2021; Pacheco et al. 2022; Pierro and Rocha 2019)).

Google BigQuery

Google BigQueryFootnote 6 is an online data warehouse that supports querying through SQL. In particular, Google also actively maintains several public datasets, including an Ethereum datasetFootnote 7. This dataset is updated daily and contains metadata for several blockchain elements, including transactions, blocks, and smart contracts. Most of the features that we engineer rely on data collected from this dataset.

4.1.2 Approach

Figure 2 provides an overview of our data collection approach. In the following, we discuss each of the data collection steps.

(Step-1) :

Draw a statistically representative sample of blocks for each day within our data collection period To bring the data size to more manageable levels, we draw a statistically representative sample of mined blocks for each day within a one-month period. To draw the samples we use a setting of a 95% confidence level and ± 5 confidence interval, similarly to several prior studies (Tagra et al. 2022; Zhang et al. 2019; Boslaugh and Watters 2008). Following, for each day in our time period we calculate the statistically representative sample of blocks, and sample the corresponding amount. This approach results in a sample of 362 blocks per day on average, and a total of 10,865 blocks.

(Step-2) :

Retrieve the hashes of each transaction included in the sampled blocks We proceed by collecting and saving the transaction hashes of all transactions that have been included within the sampled blocks. The transaction hash is a unique identifier used to differentiate between transactions. We use these transaction identifiers in several of the data collection steps to collect various associated metadata with each of these transactions. We collect the transaction hash for all 1,825,042 (1.8M) transactions included in the 10,865 sampled blocks based on the previous step.

(Step-3) :

Collect the processing time for each transaction Next we begin to collect the processing time provided by the Etherscan platform. For a given transaction t, the processing time is the estimated difference between the time that t has been included in a block that has been appended to the blockchain, and the time the user executed the submission of t onto the blockchain. We retrieve this data by using the transaction hash of each transaction in our set to access the corresponding processing time details provided by Etherscan (defined as confirmation time by Etherscan).

(Step-4) :

Retrieve additional metadata for transactions, blocks, and contracts We then leverage Google BigQuery to retrieve additional information regarding transactions, including information about the blocks and smart contracts they are associated with (when applicable). Examples of the types of the collected metadata from this source includes details such as gas price, block number, and network congestion. These metadata are obtained by querying the transactions, blocks, and contracts tables.

(Step-5) :

Retrieve historical data regarding contextual aspects of the Ethereum blockchain In addition to the processing time, the Etherscan platform provides historical data that describes some of the contextual information of the Ethereum blockchain. In particular, we collect data which describes: 1) the amount of transactions in the pending poolFootnote 8, and 2) the network utilizationFootnote 9. We then map this set of data to each transaction, associating the relevant context of the blockchain to the time each transaction was executed.

Fig. 2
figure 2

An overview of our data collection approach. The dashed lines represent a connection to a data source

4.2 Feature Engineering

To understand the features that impact the transaction processing times, we collect and engineer a variety of features. We explain the rationale behind focusing on these dimensions, as well as the rationale for designing each of the features in our set. Features were carefully crafted based on our scientific (Oliva et al. 2020; Kondo et al. 2020; Oliva and Hassan 2021; Zarir et al. 2021) and practical knowledge of Ethereum (e.g., buying/selling/trading ETH and watching the cryptocurrency market fluctuations). We list each of the features that we use for building our models in Tables 1 and 2.

  • 1. Features that capture internal factors We first engineer features that aim to capture internal factors of the Ethereum blockchain (RQ1), which are summarized in Table 1. These features span across a few dimensions, including contextual, behavioral, and historical dimensions. We explain these dimensions below.

  • Contextual. Contextual data describes the circumstances in which a given transaction is being processed. This information can be associated with the entire blockchain network, and also the transactions themselves. For example, the network utilization at the time a transaction was executed is a contextual feature. The network utilization should help provide indicators of changes to transaction processing times, such as increase in network traffic causing processing times to be delayed. Similarly, a feature which defines the smart contract a transaction had interacted with in its lifetime also provides contextual insights.

  • Behavioral. Behavioral features include information which relate to the user and the particular values that were chosen for each of the transaction parameters. Depending on the experience of the user who is sending the transaction, the values set for these parameters might impact transaction processing times. Most notably, miners have the freedom to prioritize transactions to process using their own algorithms. However, it is common for miners to prioritize processing transactions with higher gas price values, as they receive a reward for processing a particular transaction based on this value. An inexperienced user who is not familiar with gas price values might set too low of a value, resulting in an extremely long processing time.

  • Historical. Historical features relate to features based on historical data from transactions that have already been processed in the blockchain. Such metrics are widely used in machine learning for predictive modelling tasks, as data of the past is used to predict outcomes of the future. For example, a feature which describes the average processing time for processed transactions in the previous x blocks can be used by models to understand and predict processing times for similar transactions in the future.

  • 2. Features that capture changes in gas pricing behavior. We also engineer features that aim to track pricing behavior over time (RQ2). Our features primarily involve comparing the gas prices of transactions processed in recent blocks with that of the current transaction at hand. These features and their associated rationale are summarized in Table 2.

Table 1 The list of independent features used to capture internal factors of the Ethereum blockchain, along with their rationale. These features are used to build our models in Section 5 in order to predict transaction (tx) processing times. The ∙ symbol represents data retrieved from Etherscan, while the ⋆ symbol represents data retrieved from Google BigQuery. Braces ({}) group related features
Table 2 The list of independent features used to capture gas pricing behavior, along with their rationale. These features are used to build our models in Section 6 in order to predict transaction (tx) processing times. The ∙ symbol represents data retrieved from Etherscan, while the ⋆ symbol represents data retrieved from Google BigQuery. Braces ({}) group a set of related features

4.3 Model Construction

In this section, we describe our model construction steps. The goal of our constructed model is to explain how features related to internal factors and features related to gas pricing behavior influence the transaction processing times respectively. We also aim to answer which of the studied features influence the transaction processing times the most. To do this, we design and follow a generalizable and extensible model construction pipeline to generate interpretable models, as outlined in prior studies (Lee et al. 2020; Mcintosh et al. 2016; Pacheco et al. 2022; Thongtanunam and Hassan 2020). Rather than focusing on generating models that are meant to be used to predict transaction processing times with high to perfect levels of accuracy, we instead leverage machine learning as a tool to understand the impact of these features. Our model construction pipeline in summarized Fig. 3. We explain each step in more detail below.

(Step-1) :

Log normalization. We observe that the distribution of transaction processing time is right skewed. Therefore, we log transform the data with the function log(x + 1) to reduce the skew. Furthermore, we also log transform all the independent features of the type numeric with an identical function to reduce the bias caused by outliers, similar to prior studies (Lee et al. 2020; Mcintosh et al. 2016; Menzies et al. 2007; Tantithamthavorn et al. 2016).

(Step-2) :

Correlation analysis. Several prior studies show that presence of correlated features in the used data to construct a machine learning model leads to unreliable and spurious insights (Midi et al. 2013; Mcintosh et al. 2016; Tantithamthavorn and Hassan 2018; Lee et al. 2020). Therefore, in this step we outline our process to remove correlated and redundant features in our data. To remove correlated features, we use the varclus function from the Hmisc package in RFootnote 10, which generates a hierarchical clustering of the provided features, based on the correlation between them. We choose Spearman’s rank correlation (ρ) to determine the level of correlation between pairs, as rank correlation can assess both monotonic and non-monotonic relationships. Similar to prior studies (Lee et al. 2020; Mcintosh et al. 2016; Tantithamthavorn and Hassan 2018), we use a threshold of |ρ| > 0.7 to deem a pair of features as correlated. For each pair of correlated features, we choose the feature that is easier to collect or calculate in practice. We summarize our correlation analysis in Table 3 (Appendix A). As a result of such an analysis, we removed a total of 41 correlated features.

(Step-3) :

Redundancy analysis In addition to correlated features, redundant features may also negatively impact the generated explanations of a machine learning model (Midi et al. 2013; Mcintosh et al. 2016; Tantithamthavorn and Hassan 2018; Lee et al. 2020). Hence, we choose to remove redundant features. To do so, we use the redun function from the Hmisc package in RFootnote 11. This function builds several linear regression models, each of which selects one independent feature to use as the dependent feature. It then calculates how well the selected dependent features can be explained by the independent ones, through the R2. It then drops the most well-explained features in a stepwise, descending order. This feature removal process is repeated until either: 1) none of the features can be predicted by the models resulting in an R2 greater than a set threshold, or 2) removing the feature causes another previously removed feature to be explained at a level less than the threshold. In our study, we use the default R2 threshold of 0.9. As a result of such an analysis, we removed a total of 1 redundant feature, namely std_pct_below_120.

(Step-4) :

Model construction At this step, we fit our random forest model using the RandomForestRegressor function from the scikit-learn package in PythonFootnote 12. We choose a random forest model as it provides an optimal balance between predictive power and explainability. Such models are equipped to fit data better than simple regression models, potentially leading to more robust and reliable insights (Lyu et al. 2021). Although they are not inherently explainable, several feature attribution methods can and have been used to derive explanations using these models (Breiman 2001; Lundberg et al. 2018; Tantithamthavorn and Jiarpakdee 2021). Random forests have also been widely used in empirical software engineering research to derive insights about the data (Rajbahadur et al. 2021; Tantithamthavorn and Jiarpakdee 2021; Tantithamthavorn et al. 2020). Historically these models have shown good performance results for Software Engineering tasks (Yatish et al. 2019; Bao et al. 2021; Fan et al. 2020; Aniche et al. 2020). Finally, the specific random forest implementation that we choose is natively compatible with the shapFootnote 13 python package. The latter implements the SHAP interpretation technique, which we employ to address our research questions (Sections 5 and 6).

Fig. 3
figure 3

A summary of our model construction approach

4.4 Model Validation

We use the adjusted R2 to evaluate how well our models can predict transaction processing times. To compute the adjusted R2 of our constructed random forest model, we set up a model validation pipeline that splits data into train, test, and validation sets (Fig. 4). Our train dataset needs to be as large as possible, otherwise our model cannot identify possible seasonalities in the data (e.g., evaluate the influence of the day of the week in transaction processing times) nor changes in the blockchain context (e.g., network conditions are unlikely to change in small periods of time). We thus define the training set as the first 28 days of data, the validation set as the 29th day, and the test set as the final day. We split the data along its temporal nature to avoid data leakage (Lyu et al. 2021). We then use the training set to generate 100 bootstrap samples (i.e., samples with replacement and of the same length as the training set). For each bootstrap sample, we first fit a random forest model on that sample. Next, we use the validation set to hyperparameter tune that model as recommended by prior studies to ensure that our constructed model fits the data optimally. To do this we utilize a Random Search algorithm, and choose the RandomizedSearchCV (Bergstra and Bengio 2012) from scikit-learnFootnote 14. We choose to hyperparameter tune the parameters: max_depth, max_features, and n_estimators. Once we discover the optimal hyperparameters, we retrain the model using the chosen parameters (on the same bootstrap sample) and evaluate it on the test set by adjusted R2 measure. While prior studies typically measure the explanatory power of a constructed model by measuring its R2 score on the training set, we instead measure the performance of our hyperparameter-tuned model on a hold-out test set. Our reasoning is that random forest models are by default generally constructed in a way which causes all training data points to be found in at least one terminal node of the forest. As a consequence, it is common for the random forest to predict correctly for the majority of instances of the train set, resulting in an R2 that is very close to 1 (typically in the 0.95 to 1.0 range) (Liaw 2010). In addition, we use adjusted version of the R2 measure, which mitigates the bias introduced by the high number of features used in our model (Harrell 2015). As we generate 100 bootstrap samples, we obtain 100 adjusted R2 values. We report summary statistics for this distribution of adjusted R2 values in the research questions below.

Fig. 4
figure 4

Our model validation approach. RF stands for Random Forest and HP stands for hyperparameter

figure e

5 RQ1: How Well do Blockchain Internal Factors Characterize Transaction Processing Times?

Motivation

It is currently unknown which factors from the internal workings of the Ethereum blockchain best characterize transaction processing times. There exists an exhaustive set of information recorded on the blockchain at any given moment in time. For ÐApp developers, it is unclear which of these factors are useful in predicting processing times, and thus which should be monitored and analyzed.

Approach

We analyze the model’s explanatory power and feature importances. The details are shown below:

  • Analysis of a model’s explanatory power. To determine the explanatory power of blockchain factors, we engineer features based on the internal data recorded and available in the Ethereum blockchain (Table 1). Next, we use these feature to construct (Section 4.3) and validate (Section 4.4) our random forest model. As a result of applying our model validation approach, we obtain a distribution of 100 adjusted R2 scores. This score distribution determines how much of the total variability in processing times is explained by our random forest model, thus serving as a good proxy for the explanatory power of our model. We judge the explanatory power of our model by analyzing the mean and median adjusted R2 scores.

  • Analysis of feature importance. To determine what individual features are most strongly associated with processing times, we compute and analyze feature importance using SHAP (Lundberg and Lee 2017; Molnar 2020). SHAP is a model-agnostic interpretation technique based on the game theoretically optimal Shapley values. SHAP first determines an average prediction value known as the base rate. Next, for each dataset instance, SHAP determines the Shapley value of each feature. The Shapley value informs the degree to which each feature moves the base rate (positively or negatively), in the same scale as the dependent feature. The final prediction is explained as: base rate + \({\sum }_{i = 1}^{n} Shapley(f_{i})\), where fi is one of the n features of the model. The importance of a feature fi corresponds to its distribution of Shapley values across all dataset instances. More details about SHAP can be found in Appendix B.

To compute Shapley values, we first train a single random forest model on the entire train set (28 consecutive days) using the procedures that are outlined in Section 4.3. Next, we use the TreeExplainer API from the shap python package to compute the Shapley value of each feature for each dataset instance. As a result, each feature will have an associated Shapley value distribution of size N, where N is the total number of rows in our train dataset. Next, we convert these Shapley value distributions into absolute Shapley value distributions. The rationale of such a conversion is that a feature f1 with a Shapley value of x is as important as a feature f2 with a Shapley value of − x. Finally, we employ Scott Knott algorithm to rank the distributions of absolute Shapley values. The feature ranked first is the one with the highest Shapley values. We note that there can be a rank tie between two or more features, which indicates that their underlying shapley distributions are indistinguishable from an effect-size standpoint. We also note that we use Menzies’ implementation (Menzies 2020) of the original SK algorithm (Scott and Knott 1974), which (i) replaces the original parametric statistical tests with non-parametric ones and (ii) takes effect size into consideration (Cliff’s Delta). Menzies’ changes make the SK algorithm suitable to be used in the context of large datasets, where p-values are close to meaningless without an accompanying effect size measure and few or even no assumptions can be made in terms of the shape of the distributions under evaluation.Findings. Observation 1) Internal factors provide little explanatory power in the characterization of transaction processing times. Our random forest models achieve identical mean and median adjusted R2 scores: 0.16. These scores indicate that internal factors can only explain approximately 16% of the variance in processing times.

Observation 2) Gas price, median price of pending transactions, and nonce are the top-3 most important features of the model. Feature importance ranks are summarized in Fig. 5. It is unsurprising that the transaction’s gas price (gas_price_gwei) is the most important feature, since the incentive that miners receive to process transactions is a function of the gas price of the transactions that they choose to process. Nevertheless, we reiterate that our model achieves a median adjusted R2 of only 0.16 (Observation 1). That is, despite its inherent importance, the gas price of a transaction explains less than 0.16 of the variability in transaction processing times. We conjecture that the explanation lies in the relative value of a given gas price. In a way, gas prices can be interpreted as bids in an auction system. The actual, practical value of a bid will always depend on the bids offered by others. Hence, it is entirely possible that the same gas price of x GWEI could be regarded as high in one day (i.e., when other transactions typically have a gas price that is lower than x) and low in some other day (i.e., when the opposite scenario takes place). We explore this conjecture as part of RQ2, when we evaluate the explanatory power of pricing behavior.

Fig. 5
figure 5

The distribution of absolute Shapley values per feature (minutes). Features are grouped according to their importance rank (lower rank, higher importance). Due to the high skewness of the data, we apply a log(x + 1) transformation to the y-axis

The second most important feature is the median gas price of recent pending transactions from the same transaction issuer (med_pend_prices). In Ethereum, a transaction t from a given transaction issuer I can only be processed once all prior transactions from I have been processed (i.e., those transactions with nonce lower than that of t). For instance, a transaction can stay in a pending state for a long time when a prior transaction from the same issuer has a very low gas price. Hence, pending transactions penalize the processing time of the current transaction. In fact, if the value of med_pend_prices is higher than zero, then we know by construction that there exists at least one pending transaction (since gas price cannot be zero). We assume that our model has learned this rule. Interestingly, the third most important feature is the transaction’s nonce (tx_nonce). Hence, we believe that our model is combining the second and third most important features to estimate the time penalty induced by pending transactions.

There is also a complementary facet to tx_nonce, which is bound to the transaction issuer. A transaction with a high nonce indicates that its transaction issuer has submitted many transactions in the past. Experienced transaction issuers are likely to be more careful in setting up their gas prices in comparison to novice issuers (e.g., they might be more mindful about the competitiveness of their chosen gas prices compared to novices) (Oliva and Hassan 2021).

Observation 3) Contextual features have little importance. The most important contextual feature is ranked only 7th. Its median Shapley value is 0.03 minutes (1.8 seconds). In other words, in at least 50% of the cases, this features either adds or removes only 1.8 seconds to the final prediction.

figure f

6 RQ2: How Well does Gas Pricing Behavior Characterize Transaction Processing Times?

Motivation

The results of our previous research question indicates that internal factors of the Ethereum blockchain only explain a small amount of the variability in transaction processing times.

Similar to the stock market, the Ethereum blockchain ecosystem is impacted by a plethora of external factors, including speculation (e.g., NFTs, Elon Musk Tweets), the global geopolitical situation (e.g., wars, the USA-China trade war), cryptocurrency regulations (e.g., FED interventions), and even unforeseen events (e.g., the COVID-19 pandemic). Many of these factors are largely unpredictable and difficult to be captured in the form of engineered features. Yet, these factors are known to impact the gas pricing behavior of a large portion of transaction issuers (Ante 2021; Binder 2022). For instance, a cryptocurrency ban in a country can cause a massive amount of people to sell their ETH and tokens, causing many ÐApps to send transactions to facilitate those actions, thus increasing the average gas price being paid for all transactions in the platform. Since the practical value of a gas price depends on the gas price of all other transactions, we conjecture that pricing behavior (and changes in thereof) influences the processing times of transactions.

Approach

The additional features that we engineer in order to capture gas pricing behaviors are described in Table 2 along with their rationale. In summary, we introduce 42 additional features to our internal feature set. We primarily involve aggregations of the number of previously processed transactions in recent blocks and their processing times (historical features), in conjunction with the gas price of these transactions (a behavioral feature). For instance, one of our features indicates whether a given gas price is very low/low/median/high/very high by comparing it to the gas prices of all transactions included in the prior 120 mined blocks. In a way, our model from RQ2 provides a relative perspective on gas prices instead of an absolute one (RQ1).

Our approach is similar to that of RQ1. First, we reuse the model validation approach from Section 4 to determine the explanatory power of our new model (adjusted R2). Next, we compare the difference in adjusted R2 between the models from RQ1 and RQ2. The rationale is to determine the importance of the additional features (pricing behavior). We operationalize the comparison by means of a two-tailed Mann-Whitney test (α = 0.05) followed by a Cliff’s Delta (d) calculation of effect size. We evaluate Cliff’s Delta using the following thresholds (Romano et al. 2006): negligible for |d|≤ 0.147, small for 0.147 ≤|d|≤ 0.33, medium for 0.33 ≤|d|≤ 0.474, and large otherwise. Finally, to understand the importance of individual features, we plot their distribution of absolute Shapley values and compute SK.

To further understand the role of individual features, we use Partial Dependence Plots (PDP). A PDP reflects the expected output of a model if we were to intervene and modify exactly one of the features. This is different from SHAP, where the Shapley value of a feature represents the extent to which that feature impacts the prediction of single dataset instance while accounting for interaction effects. We also compute pair-wise feature interactions to understand the interplay between specific pairs of features. We use the explainer.shap_interaction_values() function from the same shap package. This computes the Shapely interaction index after computing individual effects, and does so by subtracting the main effect of the features to result in the pure interaction effects (Molnar 2020).

Findings. Observation 4) Our new model explains more than half of the variance in processing times. With the new feature set, our random forest models achieve identical mean and median adjusted R2 scores: 0.53. In RQ1, the mean and median scores were also identical at 0.16. As the numbers indicate, the pricing behavior dimension leads to an absolute increase of 0.37 in the median/mean adjusted R2. Indeed, the difference between the adjusted R2 distributions is statistically significant (p-value ≤ 0.05) with a large (d = 1) effect size. We find these scores confirm the robust explanatory power of our features, demonstrating the stability of our models.

In particular, they are compatible (and even sometimes outperform) adjusted R2 scores reported in software engineering studies that built regression models with the purpose of understanding a certain phenomenon (Bird et al. 2011; Jiarpakdee et al. 2021; Mcintosh et al. 2016). For instance, (Mcintosh et al. 2016) built sophisticated regression models using restricted cubic splines to determine whether there is a relationship between (i) code review coverage and post-release defects, (ii) code review participation and post-release defects, and (iii) reviewer expertise and post-release defects. In total, the authors built 10 models and obtained adjusted R2 scores that ranged from 0.20 to 0.67 (mean = 0.46, median = 0.41, sd = 0.17). In comparison, the median adjusted R2 score of our model is 0.53, which sits higher than the median score achieved by their models. Most importantly, our score was computed on a hold-out test set, while theirs were computed on the train set. As the hold-out test set is unseen by the model during training, it is generally harder to achieve a higher R2 on compared to the train set.

Observation 5) Gas price competitiveness is a major aspect in determining processing times. Figure 6 depicts the distribution of absolute Shapley values associated with each feature. The most important feature is the median of the percentage of transactions processed per block, in the previous 120 blocks, with a gas price below that of the current transaction (med_pct_below_120). The gas price per se (gas_price_gwei) only ranks 8th, with a small median absolute Shapley value of 0.01.

Observation 6) There is a gas price at which transactions are processed closest to the fastest speed possible (and thus increasing such a price is *not* likely to further reduce the processing time.) The PDP depicted in Fig. 7 leads to several insights. First, the range of the y-axis indicates that changes in med_pct_below_120 have a considerable influence on the predicted processing times. Second, we note an inverse relationship between med_pct_below_120 and the predicted processing time up until med_pct_below_120 = 50%. That is, let p(B) be the lowest gas price used by a transaction inside a mined block B. Let p(t) be the gas price of a transaction t. Setting p(t) such that p(t) > p(B) is true for at least half of the 120 most recently mined blocks already provides a speed boost in transaction processing time. From med_pct_below_120 = 50% onwards, however, the processing time does not decrease anymore.

In summary, (i) increasing the gas price beyond a certain value x does not tend to reduce processing times any further and (ii) the value of x is dependent on the prices of the transaction in the prior 120 blocks (otherwise the gas_price_gwei would be more important than med_pct_below_120).

Observation 7) Information about the median price of pending transactions and nonce remain important factors. Looking at Fig. 6 again, we note that the feature median price of pending transactions from the same issuer (med_pend_prices) and transaction nonce (tx_nonce) are ranked second and fourth. In RQ1, our baseline model suggests that these same two features were ranked second (the same) and third (one rank lower) respectively. In other words, these two features remain important factors.

We plotted the PDP for median price of pending transactions from the same issuer (med_pend_prices) and observed that the y-axis range is very constrained (range of 0.01 for 90% of the values of med_pend_prices). In other words, the predicted processing time changes very little when med_pend_prices is changed and everything else kept constant. Nevertheless, such a feature ranks second according to SHAP (Fig. 6). Hence, we conjecture that med_pend_prices has strong interactions with some other feature. In fact, in RQ1 we conjectured that the model could be using both med_pend_prices and tx_nonce to penalize processing time due to pending transactions. To investigate our conjecture, we calculated feature interactions between med_pend_prices and every other feature. The results indicate that the strongest interaction is indeed with tx_nonce. We then computed all pairwise feature interactions and noted that the feature pair <med_pend_prices, tx_nonce> is the second-strongest feature interaction in our model, only behind <med_pct_below_120, std_num_below_120> (i.e., the first and third most important features).

figure g
Fig. 6
figure 6

The distribution of absolute Shapley values per feature in our model with additional features. Features are grouped according to their importance rank. Due to the high skewness of the data, we apply a log(x + 1) transformation to the y-axis

Fig. 7
figure 7

A partial dependence plot for the med_pct_below_120 feature and transaction processing time

7 Discussion

In the following section, we discuss the implication drawn from our findings in the previous sections.

Implication 1) ÐApp developers should focus on quantifying how competitive their gas prices are. Price competitiveness is the most important indicator for transaction processing times among our entire set of engineered features. In particular, it is significantly more important than internal blockchain factors that practitioners typically associate with transaction processing times, including: block difficulty (ranks 17 and 20), the number of transactions in the pending pool (rank 10), and network utilization (rank 15). This is interesting considering that it is common for other non-blockchain based transaction processing systems to

In practice, ÐApp developers should ensure that they execute transactions with competitive gas prices, and should understand the competitiveness of a given specific price is likely to change over time. Given our empirical results, we believe that our metric med_pct_below_120 can serve as a starting point for ÐApp developers to quantify price competitiveness. We warn developers, however, that such a price should not be “too high”, since there is a point at which higher prices do not result in lower processing times. More generally, future research in explainable models for transaction processing times should investigate the optimal number of past blocks to consider and how frequently such a number needs to be fine-tuned over time to avoid concept drift.

Implication 2) The architecture of ÐApps should take into account the fact that processing time is only weakly associated with gas usage and gas limit. The contribution of work lies in not only identifying those features that are most important but also in identifying those that are not.

Figure 6 indicates that features related to gas usage, such as tx_gas_limit and std_func_gas_usage_120 (which is correlated with med_func_gas_usage_120), have little importance. This suggests that transactions with competitive gas prices are prioritized by miners over the amount of gas that they will potentially consume. Hence, if a ÐApp invokes a heavy computation function that consumes lots of gas, then such a transaction will still require a competitive gas price to be processed in a timely fashion. Such a ÐApp might be costly and potentially unfeasible to maintain. We briefly discuss two strategies that would help DApp developers achieve a better time-cost balance.

The first strategy is to optimize the source code of smart contracts to make it burn less gas. This problem is known in the literature as gas (consumption) optimization (Zou et al. 2019; Chen et al. 2017; Marchesi et al. 2020). A function that burns less gas would enable transaction issuers to submit transactions using a higher gas price while still maintaining the same overall cost. Consequently, as shown in our results, transactions with more competitive gas prices tend to get processed faster. However, we note that gas consumption optimization should be performed carefully, since such optimizations can hinder program comprehension and testability.

The second strategy refers to mitigating the risk of making a poor gas price choice in the case of transactions that perform repetitive operations (e.g., updating the balance of several user accounts). For instance, assume that the balance of 200k accounts need to be updated. Assume that the ÐApp in question has a function f that takes a list of accounts and updates their balances. Instead of submitting a single, heavy transaction with a potentially poor gas price choice (as current estimation services are far from perfect), it is better to split the workload into batches and send one transaction for each batch. Now, instead of making one single price choice, the transaction issuer needs to make several gas price choices (one for each batch). This amortizes the risk of poor price choices, analogously to the concept of dollar cost averaging as defined by Benjamin Graham in the context of stock investment strategies (Graham and Zweig 2003). Finding opportunities to migrate from a single heavy transaction to several smaller transactions that can be executed in batches often requires the ÐApp’s architecture to be planned in such a way that contract transactions are as atomic as possible.

We acknowledge, however, that the aforementioned strategies are unfeasible in specific scenarios or use cases. Most importantly, our work highlights an important and yet not so obvious interplay between ÐApp architecture design and transaction processing times.

Implication 3) We did not observe a strong relationship between contextual features and processing time. However, future research should not discard this type of feature. We observed that features in the contextual dimensions generally contribute very little to the resulting prediction. Most notably, neither network utilization (net_util) nor network congestion (pending_pool) hold strong relationships with processing time. This is in contrast with conventional transaction processing systems (e.g., cloud-based services), which are likely to have an increase in processing times when many users are simultaneously using the system. This may occur either because (i) contextual features really do not matter that much compared to pricing behavior or (ii) because contextual factors are not strongly present in our dataset (e.g., seasonalities and network congestion). We thus invite future research to further explore contextual features with other datasets and timeframes. Additional contextual features should also be engineered, as they may impact processing times more heavily than those included in our study.

8 Related Work

We note that the analysis we conduct in our study on the explanatory power of an exhaustive set of features with respect to transaction processing times has not yet been investigated. We discover several insights using this proposed approach, the majority of which suggest the importance of historical features related to gas prices. We hope our contributions prove useful for ÐApp developers, and invite future research to build upon our study.

Pierro and Rocha (2019) investigate how several factors related to transaction processing times cause or correlate with variations in gas price predictions made by Etherchain’s Gas Price oracleFootnote 15.

The authors conclude there is non-directional causality between oracle gas price predictions and the time to mine blocks, a unidirectional causality (inverse correlation) between oracle predictions and number transactions in the pending pool, and a unidirectional causality (inverse correlation) between oracle predictions and the number of active miners.

In a later study the same authors (Pierro et al. 2022) evaluate the processing time and corresponding gas price predictions made by EthGasStation’s Gas Price API. To do this, the authors use processed transactions by retrieving those with a gas price greater or equal to the gas prices in all processing time predictions made by EthGasStation. They then verify whether those transactions were processed in the following j blocks, as predicted by EthGasStation. The authors conclude that EthGasStation holds a higher margin of error compared to what they claim. Using a Poisson model, the authors achieve more accurate predictions than EthGasStation by considering data within only the previous 4 blocks. In turn, the authors suggest that such models should be constructed using data within the most recent blocks, as opposed to the 100 previous blocks considered by EthGasStation, to improve prediction accuracy.

de Azevedo Sousa et al. (2021) investigate the correlation between transaction fees and transaction pending time in the Ethereum blockchain.

The authors calculate Pearson correlation matrices between several features across 1) the full set of transactions, 2) distinct sets of transactions based on their gas usage and gas price, and 3) clusters of transactions resulting from DBSCAN. The authors conclude there is a negligible correlation between all features related to transaction fees and the pending time of transactions.

Singh and Hafid (2020) focus on the prediction of transaction confirmation time within the Ethereum blockchain. The study includes 1 million transactions with non-engineered features from the blockchain. The authors discretize transactions by their confirmation times (15s, 30s, 1m, 2m, 5m, 10m, 15m, and ≥ 30m). The majority of constructed classifiers (Naive Bayes, Random Forests, and Multi Layer Perceptrons) achieved higher than 90% accuracy, with the Multi Layer Perceptron classifier regularly outperforming the others.

The impact of the transaction fee mechanism (TFM) EIP-1559 on processing times is investigated by Liu et al. (2022) in the Ethereum blockchain since its relatively recent implementation. The authors discover that EIP-1559 makes fee estimation easier for users, ultimately reducing both transaction processing times and gas price paid of transactions within the same block. However, when the price of Ether is volatile, processing times become much longer. We find that features related to historical gas prices are most important in estimating processing time, and considering EIP-1559 makes fee estimations and thus assigning gas prices easier for users, similar TFMs may help ÐApp developers avoid slow processing times altogether.

The study conducted by Kasahara and Kawahara (2019) uses queuing theory to investigate the impact of transaction fees on transaction processing time within the Bitcoin blockchain. The authors conclude that transactions associated with high transaction fees are processed quicker than those that are not. The authors also find network congestion strongly impacts the processing time of transactions, by increasing the time, regardless if they are associated with either low or high fees, and regardless of block size. Conversely, our study finds network utilization to hold minimal impact on transaction processing time.

Oliva et al. (2020) analyze how smart contracts deployed on the Ethereum blockchain are commonly used. The authors conclude that 0.05% of total smart contracts were the target of 80% of all transactions executed during the time period of their study. In addition, 41.3% of these contracts were found to be deployed for token-related purposes. Our paper includes information regarding the smart contract targeted by a transaction if applicable, though we do not find that these features have strong impacts on transaction processing time.

The work of Zarir et al. (2021) investigates how ÐApp developers can develop cost effective ÐApps. Similar to our study, the authors conclude transactions are prioritized based on their gas price, rather than their gas usage. In addition, the authors find the gas usage of a function can be easily predicted. As a result, the authors suggest that ÐApp developers provide gas usage information in their smart contracts, and platforms such as Etherscan and wallets should provide users with historical gas usage information for smart contract functions.

9 Threats to Validity

Construct validity

The pending timestamp of a transaction is not publicly available information within the Ethereum blockchain. As a result, we rely on Etherscan to provide an accurate measurement of when a transaction was actually executed. Because of the size, reputability, and popularity of Etherscan, we continue to rely on it and reaffirm the accuracy of their provided data.

This paper is a first attempt toward using predictive models in order to derive and explain insights from data within the context of the Ethereum blockchain. We choose to leverage Random Forest models as they hold an optimal balance between performance and explainability. We encourage future work to build upon the approach demonstrated in our study in order to achieve models which achieve higher performance, and thus further contribute to predicting transaction processing times and the generalizability of the model explanations in our study.

Our Random Forest models are hyperparameter tuned in each bootstrap to control states of the model as well as to achieve greater performance on the hold out test set. We only consider a few hyperparameters to control, and there exist multiple hyperparameters that can be tuned which were not considered in our study. We invite future researchers to consider a more exhaustive set of hyperparameters to tune in similar Random Forest models, and to investigate their impact on performance.

Although we opt to leverage the adjusted R2 metric as a robust approach to determine the goodness of fit of the model, other performance metrics exist that can, in certain contexts, contribute toward a complete assessment of the performance of the model. As such, future work should assess similar models using additional performance metrics.

We fully rely on SHAP values to generate explanations for the predictions of our Random Forest models. It is possible that different feature importance methods would generate different results. We thus encourage future work to explore other methods for explaining models (e.g., LIME (Ribeiro et al. 2016)).

Internal validity

In our paper, we choose to study the explanatory power of a set of both internal and external features spanning across various dimensions. The majority of features in our set include those which primarily reflect technical aspects of the Ethereum blockchain. However, other types of features, such as those that are based upon and or track market speculation (e.g., Elon Musk tweets (Ante 2021)), exist which might play a big role in describing the behavior of the Ethereum blockchain. As a result, we invite future work to design and investigate an even more exhaustive set of features which were not included in our study. Additionally, we use some features as a proxy for capturing other information, including using the transaction nonce (tx_nonce) as user experience. Instead, future work can explicitly collect such information to study their direct impact on processing times

The collected data and concluding results naturally depend on the defined time frame of our study. We also use sampling methods to construct our results, though we draw a representative sample of blocks appended to the Ethereum blockchain per day, in an attempt to capture the complete and natural behavior of the network during this period. Seasonal trends are known to impact the blockchain (Kaiser 2019; Werner et al. 2020) and therefore may have affected the insights and conclusions derived in our study. As a result, we suggest that future studies should validate our findings by replicating them using different and longer time frames. Finally, we emphasize that in this paper we demonstrate an approach that is extensible, and therefore can be followed in similar contexts with similar goals.

In addition to the dependence on the chosen time frame, our study also relies on actively collecting data from Etherscan. As Etherscan does not offer a robust API which allows us to collect all data points used in our study, we developed other methods of data collection. Our data collection methods respect Etherscan’s platform by adhering to standards they have set so as to not disrupt the availability of their normal services. Because of this, we are unable to collect every transaction executed within the Ethereum blockchain, which might result in imbalanced data, such as collecting a low amount of transactions that are processed in extremely fast or slow times.

External validity

By construction, the conclusions that we draw in our paper are tied to the dataset and time frame that we analyzed. Since our approaches for data curation and model interpretation were designed to be robust, we consider our conclusions to be reliable. Nonetheless, due to the phenomenon of concept drift (Nishida and Yamauchi 2007), most machine learning models need to be retrained from time to time (a.k.a., periodic retraining, refreshing). In this context, it remains unclear how long it would take for our model to start suffering from concept drift. We invite future research to build models similar to ours and evaluate their resilience to concept drift.

Our study focuses solely on the Ethereum blockchain during a specific time frame. As it is likely for blockchains to be unique in purpose and design, the consistency of the conclusions resulting from our study may not hold in other blockchain environments. For example, transaction fees in the Bitcoin blockchain are calculated differently than fees in the Ethereum blockchain, and thus might not impact processing times as heavily. Additionally, blocks in the Bitcoin blockchain take about 10 minutes on average to be appended to the chain, which might result in factors impacting transaction processing times very little. We again note that we designed the approach in this paper to be as extensible as possible, in order to be reused in similar contexts with similar goals. In turn, we encourage future research to explore ideas and conduct replication experiments similar to ours in other blockchain platforms, as well as different time frames, which may be fruitful.

10 Conclusion

ÐApps on the blockchain require code to be executed through the use of transactions, which need to be paid for. For ÐApps to be profitable, developers need to balance paying high amounts of Ether to have their application transactions processed timely, and high-end user experience. Existing processing time estimation services aim to solve this problem, however they offer minimal insight into what features truly impact processing times, as the platforms to not offer any interpretable information regarding their models.

With this as motivation, we collect data from Etherscan, and Google BigQuery to engineer and investigate features which capture internal factors and gas pricing behaviors. We use these features to build interpretable models using a generalizable and extensible model construction approach, in order to then discover what features best characterize transaction processing times in Ethereum, and to what extent. In particular, we discover that metrics regarding the gas price information of transactions processed in the recent past to hold the most explanatory power of all features. In comparison, the studied features which capture gas pricing behaviors hold more explanatory power than any feature dimension found in internal factors alone and combined. Resulting from our findings, ÐApp developers should primarily focus on monitoring recent gas pricing behaviors when choosing gas prices for the transactions in their applications. Ensuring gas prices chosen by ÐApp developers are competitive compared to recent gas pricing will ultimately help them achieve a desired level of QoS. In addition, ÐApp developers should also avoid setting high gas prices that deviate too much from the recent past to avoid diminishing returns. We encourage future work to engineer and investigate other factors that capture gas pricing behaviors, such as the one proposed in this study, to see how they impact model predictions.

As gas price is much more important than gas limit, ÐApp developers should avoid designing contract functions which consume large amounts of gas. Instead, developers should consider performing gas optimizations to the contracts’ source code and adjusting transaction workloads to mitigate poor gas price choices. We invite future work to directly focus on the relationship between smart contract gas consumption, ÐApp’s architecture design, and transaction processing times.

Our results show that blockchain internal factors are only loosely associated with transaction processing times. Nevertheless, there were specific times in history when Ethereum substantially slowed down due to a flooding of transactions (e.g., the CryptoKitties incident (BBC 2017)). Therefore, future work should reinvestigate the factors impacting processing times in the context of seasonalities and heavy workloads.

ÐApp developers can potentially reduce the amount of internal factors being actively monitored, as we found internal factors to generally contribute very little to the resulting prediction. More specifically, internal factors which are we studied in our contextual dimension do not impact the predictions of the models at all. In turn, ÐApp developers can focus more on gas pricing behaviours and a smaller subset of internal factors when aiming to achieve their desired QoS. However, we suggest that future work should not exclude such factors. Future work should focus on these factors explicitly, for example by engineering new contextual features or including those that our analysis did not, which may be more powerful in their impact on processing times.

We also encourage and invite future work to investigate other explainable models for transaction processing times which include additional features and refine the ones we propose to achieve higher performance. Our most robust models achieve an adjusted R2 of 0.53, meaning almost half the variance is not covered by the many features we collect and engineer.

Finally, we encourage future research to use the results of our experiments to motivate the continuation of empirical studies within the area of blockchain. In relation to our study specifically, possible topics for future research include: i) the construction and interpretation of other types of models which can more accurately predict processing times, ii) similar experiments conducted in blockchain environments other than Ethereum, and iii) a similar or new set of experiments conducted for comparison which use alternative parameters to those in our study, such as time frames, statistical methods and techniques, and additional features.

11 Disclaimer

Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the author(s) and do not reflect the views of Huawei.