Evolution of automated weakness detection in Ethereum bytecode a comprehensive study

Blockchain programs (also known as smart contracts) manage valuable assets like cryptocur-rencies and tokens, and implement protocols in domains like decentralized ﬁnance (DeFi) and supply-chain management. These types of applications require a high level of security that is hard to achieve due to the transparency of public blockchains. Numerous tools support developers and auditors in the task of detecting weaknesses. As a young technology, blockchains and utilities evolve fast, making it challenging for tools and developers to keep up with the pace. In this work, we study the robustness of code analysis tools and the evolution of weakness detection on a dataset representing six years of blockchain activity. We focus on Ethereum as the crypto ecosystem with the largest number of developers and deployed programs. We investigate the behavior of single tools as well as the agreement of several tools addressing similar weaknesses. Our study is the ﬁrst that is based on the entire body of deployed bytecode on Ethereum’s main chain. We achieve this coverage by considering bytecodes as equivalent if they share the same skeleton. The skeleton of a bytecode is obtained by omitting functionally irrelevant parts. This reduces the 48 million contracts deployed on Ethereum up to January 2022 to 248328 contracts with distinct skeletons. For bulk execution, we utilize the open-source framework SmartBugs that facilitates the analysis of Solidity smart contracts, and enhance it to accept also bytecode as the only input. Moreover, we integrate six further tools for bytecode analysis. The execution of the 12 tools included in our study on the dataset took 30 CPU years. While the tools report a total of 1307486 potential weaknesses, we observe a decrease in reported weaknesses over time, as well as a degradation of tools to varying degrees.


Introduction
Smart contracts are event-driven programs running on the nodes of decentralized networks known as blockchains.Specific transactions, once included in the blockchain, trigger the execution of these blockchain programs.Every node executes the code locally within a virtual machine and updates its state of the blockchain.The computations are deterministic, ensuring that all nodes arrive at the same state.The flexibility of smart contracts and the unique properties of blockchains, most notably decentralization and immutability, gave rise to innovative applications in areas like decentralized finance and supply chain management.Their potential has led to ecosystems with large numbers of start-ups and market caps of hundreds of billions of USD.
Against this background, functionalities in smart contracts can lead, and have led, to costly disruptions and losses.Early on, academia and industry focused on methods and tools for developing secure smart contracts.In a survey on automated vulnerability detection conducted in mid-2021, Rameder et al. [2022] identified 140 tools for Ethereum, the major smart contract platform.The sheer number makes it hard to decide which tools may be suited for the task at hand and calls for regular tool evaluations and comparisons.
The work presented in this paper is unique in several respects.First, we analyze the evolution of weakness detection over time with a focus on the quality of tools.
Second, we aim at a complete coverage of the Ethereum main chain, which is a formidable endeavor in light of 48 million deployments of smart contracts (up to Jan 2022).This enables us to investigate the evolution of weakness detection over a period of more than six years.We select one contract per skeleton of bytecode (cf.Section 3.2), which reduces the number of objects to analyze to 248 328.
Third, we concentrate on the runtime bytecode as input to the tools.In fact, surveys usually evaluate tools based on benchmarks of Solidity source code (cf.Section 5.2).However, many tools, in particular those considered here, actually analyze the bytecode.If possible, any findings are later attributed to the source line closest to where the bytecode originated from.Moreover, for many contracts on the blockchain, the source code is not available.By choosing runtime bytecode as the least denominator, we can include tools not contextualizing their findings, and we are able to consider all smart contracts deployed so far.
Finally, to perform our study, we extended SmartBugs [Ferreira et al., 2020], a framework for executing analysis tools in a unified manner, which is used by developers to analyze smart contracts routinely with several tools at once.Integrating new tools into the framework makes them available for future evaluations by others.
With 13 tools, 15 weakness classes, 248 328 runtime bytecodes of smart contracts and an execution time of 31 years, our evaluation is more comprehensive than previous studies.In summary, the contributions of this paper are: • A method for selecting a feasible number of smart contracts that are representative of 48 M blockchain programs deployed on Ethereum in the course of six years.
• An extension of the framework SmartBugs to include 13 tools for vulnerability detection with bytecode-only input.
• A portrait of the evolution of tool behavior and weakness detection on the 2 248 328 representative smart contracts.The tools in our study report findings with varying degrees of certainty, from warnings about potential weaknesses to exploits demonstrating the existence of a vulnerability or, more rarely, proofs guaranteeing their absence.As proving the absence or presence of software properties is difficult, most tools employ heuristics, usually favoring a higher number of false positives over the possibility to overlook an actual vulnerability.Such tools issue warnings and leave the final assessment to the user.
Deployment vs. runtime code.To deploy a contract on an Ethereum chain, an external user submits a create transaction, or the Ethereum Virtual Machine (EVM) executes a create instruction.The transaction/instruction contains the deployment code.It consists of an active part, D, which typically sets up the environment for the new contract.At its end, D returns the pointer to a memory area with the actual runtime code, which the EVM then stores at the address of the new contract.The deployment code is free to assemble the runtime code arbitrarily, but typically just copies code following D.
Source code vs. bytecode.The majority of Ethereum contracts are written in Solidity, a programming language inspired by C++.The so-called constructor and any global initializations compile to the active part of the deployment code, D, whereas all other parts of the source file compile to the runtime code proper, R, which is appended to D. After R, the compiler appends meta-data, M , which contains a hash identifying the original source code.Changing any character in the Solidity file, including comments, alters M and leads to a superficially different deployment and runtime code.
Listing 1 shows the Solidity code of a contract C1 that deploys a contract of type C2 as part of its own deployment.At runtime, each call to function f deploys a contract C3.The compiler generates a bytecode of the form Skeletons.The skeleton of a contract is obtained by removing meta-data, the arguments of PUSH operations, constructor arguments, and trailing zeros.The  rationale is to remove parts that contribute little to the functionality of the contract, with the aim to equate contracts with the same skeleton.3Table 1 gives an overview of the deployment activities on Ethereum's main chain up to block 14 M (Jan 2022).The 48.3 M contract creations involved 2.2 M different deployment codes, generating a total of 0.5 M distinct runtime codes.The removal of meta-data reduces the number of distinct codes by 29 %, the removal of PUSH constants by another 22 %.
A family of codes is a collection of codes with the same skeleton.The size of the families seems to follow a Pareto principle: 84 % of the families are singletons (the skeleton is uniquely associated with a single runtime code), 15 % of the families consist of 2 to 10 codes, whereas at the other end of the spectrum we find a skeleton shared by 16 372 codes.4

Study Design
In this section, we define the research questions, describe the data set and the selection of tools, the execution framework, the weaknesses and the taxonomy considered, and finally the mapping of the tool findings to the taxonomy.

Research Questions
RQ1: Abstraction.How well are skeletons suited as an abstraction of functionally similar bytecode in the context of weakness analysis?We investigate whether and how contracts with the same skeleton differ for a weakness analysis with bytecode input.
RQ2: Weakness Detection.Which weaknesses do the tools report for the contracts on Ethereum's main chain?We are interested in the evolution of types and numbers of weaknesses reported for the deployments up to early 2022.
RQ3: Tool Quality.How do analysis tools behave in a weakness analysis with bytecode input?We investigate the tool quality with respect to maintenance aspects, execution time, errors, and failures.

RQ4: Overlap Analysis.
To what extent do the tools agree on the findings?
We determine the amount of tool agreement per weakness on a timeline.

Data
As we strive for a complete coverage of Ethereum's main chain, we collect the runtime codes of all contracts (including the self-destructed ones) that were successfully deployed up to block 14 M.5 For each family of codes, i.e., for each collection of codes sharing the same skeleton (see Section 2), we pick a single representative and omit the others.For practical purposes, we prefer deployments where Etherscan lists the corresponding source code.We obtain a dataset of 248 328 runtime codes with distinct skeletons that represent all deployments until January 13, 2022.6 99.0 % of these codes originate from the Solidity compiler (as determined by characteristic byte sequences), with the source code for 46.5 % actually available on Etherscan.For our temporal analyses, we associate each code with the block number where the first member of the family was deployed.The longest-lived family consists of two codes implementing an ERC20 token.The codes were deployed 17 333 times over a range of almost 12 million blocks, whereas the most prolific family consists of 20 codes deployed over 12 million times.7Not all bytecodes are proper contracts.In particular in the early days of the main chain, during an attack, a number of large contracts were deployed that served as data repositories for other contracts.For some tools, this leads to a noticeable spike in the error rate around block 2.3 M.

Tools
Our study aims to integrate tools into a common framework and to perform a bulk analysis, with large numbers of runtime bytecodes automatically checked for functionalities.This imposes the following constraints on the tools.1. Availability: The tool needs to be publicly available with its source open.2. Interface: The tool can be controlled via a command-line interface.3. Input: The tool is able to analyze contracts based on their runtime bytecode alone.4. Findings: The tool offers an automated mode to report weaknesses. 5. Documentation: There is sufficient documentation to operate the tool.These criteria exclude tools that need access to the abstract binary interface8 or to source code.Likewise, tools that expect an additional setup like an external blockchain with information on the environment do not fit our setting.Starting from the 140 tools identified by Rameder et al. [2022], the criteria above leave us with the 13 tools in Table 2.

Execution Framework
For the large-scale execution of our study, we had the choice between two frameworks (cf.Section 5.2): SmartBugs [Ferreira et al., 2020] and USCV [Ji et al., 2021], both operating on Solidity level.We decided on the former, as SmartBugs is better maintained and already contained more of the tools we were interested in.
First, we adapted SmartBugs to accept bytecode as input, and updated the Docker images of the tools accordingly.Second, we integrated six further tools (kept in boldface in Table 2).The most laborious part was the output parsers.For each tool, a dedicated parser scans the output of the tool to identify the result of the analysis, to detect anomalies, and to discard irrelevant messages.
For each run of a tool on a bytecode, the parser reports a list of findings (tags identifying the detected properties), a list of errors (conditions checked for and reported by the tool), a list of fails (low-level exceptions not adequately handled by the tool), and a list of messages (any other noteworthy information issued by the tool).
Choice of Parameters.Ren et al. [2021] show that the choice of parameters strongly affects the results, especially when the timeout is below 30 minutes per contract.We set the maximal runtime to 1 800 s wall time, with 1.5 CPUs assigned to each run.If a tool offers a timeout parameter, we communicate the runtime minus a grace period to allow the tool to terminate properly.Conkas, eThor, Maian, Securify, teEther and Vandal offer no such parameter and are stopped by the external timer.
As there is a tradeoff between the memory limit per process and the number of processes run in parallel, we aimed at providing sufficient but not excessive memory.Based on an initial test with 500 contracts, we set the memory limit to 20 GB for eThor, Pakala, Securify and teEther, and to 4 GB for all other tools.We reran tasks with a limit of 32 GB if they had failed with a segmentation fault or a memory problem.
Machine.We used a server with an AMD EPYC7742 64-Core CPU and 512 GB of RAM.Table 3 gives an overview of the computation time, memory usage, and memory fails before and after the rerun with 32 GB.

Mapping of Tool Findings
Taxonomy.To compare the tools regarding their ability to detect weaknesses, we need a taxonomy with an adequate granularity.Since there is no established taxonomy of weaknesses for smart contracts, previous studies [Chen et al., 2020, Tang et al., 2021, Wang et al., 2021, Kushwaha et al., 2022b, Rameder et al., 2022, Tolmach et al., 2022, Zhou et al., 2022] not only summarize potential issues, but also structure them with respect to their own taxonomies, none of which is compelling or widely used.
Among the community projects, there are two popular taxonomies: DASP TOP 109 from 2018 with 10 categories, and the SWC registry10 with 37 classes, last updated 2020.As for DASP, two categories, Access Control (2) and Other (10), are quite broad, while Short Address ( 9) is checked by hardly any tool.Moreover, DOS (5) and Bad Randomness (6) are effects that may be the result of various causes, and most tools detect causes rather than consequences.
The SWC registry is more granular as it offers several classes for the broad categories Access Control and DOS.Moreover, most of its categories match relevant findings of the tools.Therefore, we select this taxonomy as the basis of our comparison.
Findings mapped.The tools report 82 different findings, of which we can map 56 to one of the 37 classes of the SWC taxonomy (see Table 5).In total, the tools cover the 15 weakness classes listed in Table 4, with the number of findings and the number of tools.When a tool reports a finding, we assume that it is not invalidated by an accompanying error condition, a low coverage of the bytecode, or a timeout.However, we note errors, timeouts, and unhandled conditions (fails).
Findings omitted.We omit the nine findings of HoneyBadger, as it does not detect weaknesses, but patterns characteristic of honeypots, i.e., of contracts that pretend to be vulnerable (as an incentive for an exploit attempt) but keep any Ether transferred.
Moreover, we omit seven redundant, intermediate, or positive findings: four findings of of Maian (accepts Ether, no Ether leak, no Ether lock, not destructible), one of Osiris (arithmetic bug), one of Vandal (checked call state update) and one of eThor (secure).
Finally, 10 findings do not match any SWC class and are omitted as well: one finding of Ethainter (unchecked tainted static call ), one of Securify (missing input validation), two of Maian (Ether lock,Ether lock Ether accepted without send ), five of Osiris (Callstack bug, Division bugs, Modulo bugs, Signedness bugs, Truncation bugs), and one of Oyente (Callstack Depth Attack Vulnerability).

Abstraction
To validate our hypothesis that bytecodes with the same skeleton (i.e., members of the same code family) behave similarly regarding bytecode analysis, we randomly select 1 000 bytecodes from all runtime bytecodes not in our data set.By construction, these codes belong to families with at least two members.The selected bytecodes happen to belong to 620 families.We add the corresponding 620 representatives from our data set, obtaining a dataset with 1 620 bytecodes and 620 families with 2 to 64 members per family.
When running analysis tools on different members of the same family, we expect nearly identical results with small variations due to differences in runtimes (e.g. one run timing out while the other one finishes just in time with some finding) or due to the effect of different constants when solving constraints.In particular, we do not expect the meta-data injected by the Solidity compiler to affect the result, as it is interpreted neither as code nor as data during execution.To confirm this, we also consider a copy of our 1 620 bytecodes, where we replace all meta-data sections with zeros.
Table 6 shows the result of running all tools on the bytecodes with and without meta-data.Columns two and four give the percentage of the 620 families for which the findings differ within the family, whereas columns three and five consider all data collected by the output parsers, including errors, fails, and messages.If we assume that the various effects influencing the output give rise to a normal distribution, then for a confidence level of 95 %, the sample size of 620 yields a margin of error of 1.5 % for the smaller values in the table and of 3.2 % for the larger ones.
The seven tools on top behave essentially as predicted.For Conkas, the rate of 1.5 % corresponds to 9 families with divergent findings.These differences are related to warnings about integer under-and overflows, and may indeed be the result of different constants in the codes of a family.Observe that for these seven tools, there is hardly any difference between the two datasets, with and without meta-data.
HoneyBadger, Osiris, and Oyente seem remarkable, as we find 20 % discrepancies in the output.Oyente starts its analysis by disassembling the entire bytecode.It issues the warning 'incomplete push instruction' when stumbling upon a supposed PUSH instruction near the end of the meta-data that is followed by too few operand bytes.These spurious messages disappear when removing the meta-data, but otherwise do not affect the analysis.HoneyBadger and Osiris reuse Oyente's code and inherit this anomaly.
eThor also scans the entire bytecode.When encountering an unknown instruction, it issues a warning and ignores the remaining code.Like with Oyente, these messages mostly disappear when removing the meta-data.However, unlike Oyente, the meta-data influences the result of the analysis, as can be observed by 2.9 % vs. 1.0 % differences in the findings for code with vs. no meta-data.In each of these cases, the analysis times out for some member(s) of the family but terminates with identical results for the others.We did not research the cause for these discrepancies but suspect that it may be comparable to the situation of Vandal.
In a tour de force, Vandal constructs a control flow graph for the entire bytecode and decompiles it to an intermediate representation.Vandal sometimes gets lost during this initial phase and times out.The situation improves when removing dead code like the meta-data.However, as Vandal interprets the addresses of all code sections relative to the beginning of the bytecode, even if they belong to a different contract (see the discussion on the structure of bytecode in Section 2), we still see differences regarding errors and fails.
Maian starts by scanning the entire bytecode for certain instructions, like SELFDESTRUCT.Not detecting the opcode anywhere lets Maian immediately conclude certain properties, whereas finding the opcode triggers a reachability analysis that may remain inconclusive.This sensitivity to single bytes yields divergent results for 70 families.For example, Maian may detect non-destructibility for one code and fail to do so for another one in the same family.Removing the meta-data gets rid of these divergences almost entirely.
Observation 1. Treating bytecodes with the same skeleton as equivalent works for 10 out of 13 tools without reservations.Three tools unexpectedly analyze the meta-data, leading to minor output variations.Therefore, skeletons can be regarded as a suitable abstraction for large-scale analyses aimed at the big picture.Removing the meta-data prior to analysis may improve the performance of some tools (while not harming others).

Weakness Detection
In this section, we take a look at the overall findings reported by the single tools, and the numbers per weakness class on a timeline of blocks to portray the evolution.
Tool reports.Figure 1 depicts the reporting rate of the single tools over the range of 14 M blocks.Each data point represents the percentage of bytecodes in a bin of 100 k blocks that were marked with at least one finding by the respective tool.The gray vertical lines indicate forks that added EVM opcodes that may have affected weakness detection.
Overall, the share of contracts flagged by the tools diminishes over time.This can be interpreted as newer contracts being less vulnerable than older ones, or as the tools becoming less effective in detecting weaknesses.The latter may apply to unmaintained tools that cannot cope with the code generated by more recent versions of the Solidity compiler.
There are some exceptions to the general trend.In the upper plot of Figure 1, eThor (red) flags an increasing number of contracts near the end of the timeline.eThor reports contracts as insecure regarding a reentrancy attack or as provably secure from such an attack.This second type of findings is responsible for the rise from block 11.5 M onward, while the number of contracts flagged as insecure falls like the orange line.Hence, the rise in eThor's findings actually confirms that reentrancy issues become less common.For our further discussions on weaknesses, we will consider the finding 'insecure' only.
Another exception is Vandal (third plot in Figure 1, orange).It flags 76 % of the contracts even though it only checks for five weaknesses (SWC 104,105,106,107,115). 97 % of the contracts with a CALL instruction are reported to contain an unchecked call (SWC 104), in most cases also a reentrant call (SWC 107).This is surprising as the return value of most calls is indeed checked, in particular method calls where the Solidity compiler automatically inserts appropriate instructions.Vandal even flags contracts without CALL instruction, probably irritated by a byte among the meta-data with the same value as CALL.We omit Vandal from the comparison below since its overreporting distorts the overall picture.

SWC classes detected.
In Table 4, we give an overview of the weaknesses reported by the tools that we are able to map to a suitable SWC class.The column frequency counts the number of unique skeleton bytecodes, where at least one tool reports the respective weakness11 .As the tools tackle differing subsets of the SWC classes, the number of tools addressing a specific weakness varies from one to seven.Due to our cumulative counting, the frequency of a weakness increases with the number of tools claiming to detect it, especially with overreporting tools.
Figure 2 depicts the 15 SWC classes on the timeline of 14 M blocks.For every SWC class, a data point represents the percentage of skeleton bytecodes in a bin of 100 k blocks that were marked with the respective weakness by at least one tool (Vandal excluded).The top plot shows the classes detected by four or more tools (SWC 101,105,107,114,116), the middle one those handled by two or three tools (SWC 104,106,112,113,124), and the third one those addressed by just one (SWC 110,115,120,127,128).
Comparison by SWC class.We see five weaknesses decreasing over time from a high (≥ 50 %) or medium (20 %) level to a medium or low (≤ 10 %) level: The findings of classes 101, 104, 107, 110, and 114 start falling from about block 4 M onward.The other 10 weaknesses stay on a steady, but low level after block 4 M, except for 113 (middle plot), which fluctuates around 10 % and 116 (top plot), which fluctuates around 20 %.
The decrease of potential integer overflows (101) seems natural: Since version 0.8.0, the Solidity compiler adds appropriate checks automatically, and already

Tool Quality
To assess the quality of the tools, we consider aspects related to maintenance, execution time, errors, and failures.
Maintenance.red), with the majority related to unknown instructions (mainly the shift operations SHR and SHL).This is also reflected in Figure 4, where virtually all errors occur after the Constantinople fork (block 7.28 M), when the shift instructions were added (EIP 145).Interestingly, even though Osiris and HoneyBadger are based on Oyente, the latter tool does not report any such errors.While most reported findings are not accompanied by any errors or failures, there are three notable exceptions.Maian detects numerous occurrences of Ether lock in spite of encountering unknown instructions.The same accounts for Osiris when it reports the Callstack bug.This is due to the fact that the tools apply local pattern matching instead of symbolic execution.Pakala reports a timeout for almost half of its analyses with findings.
eThor, Pakala and teEther show a large number of timeouts (marked red in Table 9), which results in high average runtimes (marked red in Table 8).While Mythril shows a similarly high average runtime, it only has a low number of timeouts.In contrast to the other three tools, it offers a parameter for getting notified about the external timeout and so is able to finish in time.
Regarding out-of-memory exceptions, only teEther sticks out.Even with 32 GB of memory, it still fails for 16 % of the inputs.
Observation 3. Regarding resource consumption, a few tools require less than 60 s per contract with just a few GB of memory, whereas others regularly approach the limits of 30 min and 32 GB.The rate of tool-reported errors varies between 0 % and 60 %, with the high rates resulting from tools operating outside of their specification.Questionably, there are tools with similar limitations but without any error at all.Regarding robustness, eight tools throw an exception for less than 1 % of the contracts, as opposed to one tool with 25 % fails.Program issues like type exceptions may be a consequence of using the dynamically typed language Python.

Overlap Analysis
In this section, we investigate to which extent tools agree in their judgments.We use the SWC registry as a common frame of reference and map all findings to an appropriate SWC class, if any.Clearly, this excludes weaknesses that do not fit any SWC class and properties other than weaknesses.Most notably, the comparison excludes Honeybadger, as it detects properties characteristic of honeypots.
To determine the degree of overlap, we use the following measure.For a tool t, let Swc(t) be the set of SWC classes that t is able to detect, and let Flagged(t, s) be the set of contracts that t flags for having a weakness of class s.We define the overlap between two tools t 1 and t 2 as The numerator counts, per weakness, the contracts flagged by both tools, while the denominator gives the number of all contracts flagged by the first tool.This measure is not symmetric.Overlap(t 1 , t 2 ) = 100 % means that for the SWC classes in common, t 1 flags a subset of the contracts flagged by t 2 .If additionally Overlap(t 2 , t 1 ) = 100 % holds, then the two tools are in perfect agreement, something to be expected for t 1 = t 2 only.
Table 10 shows the overlap between any two tools.Since eThor detects reentrancy only, its row and column in the table give us an impression of how differently a weakness may be assessed by the tools.Regarding Vandal, we find the highest values in its column and the lowest in its row.This means that most weaknesses it reports are not backed by other tools, a sign of overreporting.Another observation concerns Osiris and Oyente.We expect a high overlap as Osiris extends Oyente.In fact, 90.2 % of Oyente's findings are backed by Osiris, while Oyente covers 58.5 % of Osiris' findings.Osiris detects not only additional weaknesses, which are not considered in the comparison, but flags additional contracts with weaknesses the tools have in common.Figure 5 shows the overlap in more detail.We exclude Vandal (due to its overreporting) and Oyente (as Osiris extends it), to avoid an inflation of overlaps.Each row gives a breakdown of the contracts flagged by a specific tool, for each SWC class covered by at least two tools.Blue identifies the share of contracts flagged exclusively by the tool, whereas red, green, and purple indicate the share also flagged by one, two, or more other tools.A good agreement shows as purple where four or more tools check for the SWC class (101, 105, 107), green where three tools detect it (104,106,112,114,116), and red for two tools (113,124).

Conkas
SWC 101 -Integer Overflow and Underflow: We find hardly any agreement of all four tools.MadMax, by construction, checks for a subcase of 101 that is not covered by the other tools, but even green (overlap of three) is rare.
SWC 104 -Unchecked Call Return Value: The three tools show some agreement, as red and green dominate blue.
SWC 105 -Unprotected Ether Withdrawal: Detected by six tools, we see the highest amount of purple among all classes.
SWC 106 -Unprotected SELFDESTRUCT Instruction: Virtually all of Maian's findings coincide with at least one other tool, while Ethainter and Mythril show a fair amount of blue.The top plot of Figure 4 provides an explanation: In the second half of the timeline, the error rate of Maian increases, as the tool fails to handle more recent contracts with new types of instructions, so Maian stops reporting weaknesses.
SWC 107 -Reentrancy: Even though reentrancy is one of the best-researched weaknesses and is detected by five tools, agreement of more than three tools is rare.SWC 112 -Delegatecall to Untrusted Callee: This weakness is detected by three tools, hence the large amount of green actually indicates the best agreement in the chart.Ethainter seems to implement a more liberal definition of the vulnerability, as it flags many additional contracts (blue).
SWC 113 -DoS with Failed Call: MadMax has been designed to detect specific gas-related issues, which partly map to this class.There is some overlap with Mythril, but since the latter flags many more contracts under this label, the red share is not visible in Mythril's bar.
SWC 114 -Transaction Order Dependence: The bars are mainly blue and red, indicating little agreement between all three tools.SWC 116 -Block Values as a Proxy for Time: Virtually all contracts flagged by Osiris are also flagged by one of the other tools, in most cases by both.The other tools, however, flag many more contracts, as the comparatively small size of the green part -representing the same group of contracts in all three barsshows.Like in the case of Maian and SWC 106 above, the error rate of Osiris increases in the second half of the study period, as new instructions prevent it from reporting weaknesses (Figure 4).Observation 4.There is little agreement between the tools regarding the findings, even for well-researched and frequently analyzed weaknesses such as reentrancy.Contributing factors are the lack of commonly accepted, precise definitions for the weaknesses as well as diverging approaches to detect them.A mutually low agreement suggests that the tools are rather complementary.

Relation between Findings, Errors/Failures, and Overlap
Taking a closer look at the evolution of two SWC classes, we discuss connections between the findings and their overlap in the context of the errors and failures that the tools report.SWC 101 -Integer Overflow and Underflow: In Figure 6, the upper middle plot depicts the percentage of bytecodes flagged by the tools.Starting from different levels around 70 % and 40 %, Conkas and Mythril converge at 10 % at the end of the timeline.Osiris shows a weakness level comparable to these tools for most of the timeline but falls to 0 % towards the end.MadMax reports hardly any cases throughout the whole timeline.For Osiris, the drop to 0 % is related to the rise of errors (lower middle plot).The tool was not designed for the instructions introduced at later forks, so recent contracts lead to an error and the analysis aborts without findings.MadMax specializes in gas rather than arithmetic issues, with one finding constituting a very specific type of overflow, which is a small subclass of SWC-101.Therefore, is not surprising that it reports very low numbers.The top plot in Figure 6 depicts the percentage of agreement between the tools reporting on SWC-101 on a timeline of blocks, in bins of 100 k blocks.The blue line shows the total percentage of bytecodes flagged for SWC-101 by at least one tool, while the black line shows the total absolute number of flagged bytecodes (scale on the right).While the absolute number of flagged bytecodes fluctuates (as it depends on the number of deployments), the percentage steadily drops from as high as 90 % to less than 20 %.
Regarding the overlap, the four tools hardly ever agree (top plot: purple spots on top).As MadMax does not report much by design, we focus on the other three tools.The green area mostly reflects the agreement between Conkas, Osiris, and Mythril, while the red area represents an agreement between two tools.We observe that both, the agreement between three and two tools decreases over time.Simultaneously, the percentage of bytecodes increases where just a single tool reports a finding for SWC-101 (dark pink for MadMax, light pink for Osiris, light gray for Conkas, and medium gray for Mythril).Moreover, we see that even though the numbers of Mythril and Conkas converge towards the end (top: same size light and medium gray area; upper middle: blue and green line), the overlap is close to zero (top plot: small red area, large light and medium gray area).
SWC 107 -Reentrancy: In Figure 7, the upper middle plot depicts the percentage of bytecodes flagged by the tools.Except for the first 2 M blocks, Mythril, Osiris, Oyente, and Securify report consistently far lower findings than Conkas and eThor.From block 3.5 M onward, Conkas and eThor report similar rates of reentrant contracts, but the Jaquard similarity of the flagged contracts (number of contracts flagged by both tools divided by the number of contracts flagged by at least one tool) is only 45 %, and drops to 28 % for the last part where the graphs coincide.This is also reflected in the top plot, where the overlap between the two (red and green areas) decreases steadily from block 4.5 M onward, while the sole reporting (blue for eThor and light pink for Conkas) increases.
The error rates (lower middle plot) stay low for all tools except for Osiris.While the failure rate (bottom plot) increases for Conkas up to 70 %, it fluctuates between 20 % and 40 % for eThor.Hence, the decrease in findings for Conkas is related to its increase in the failure rate, while we cannot relate it for eThor.On the contrary, for the last 2 M blocks both rates decrease for eThor.
The top plot in Figure 7 depicts the percentage of agreement between the tools reporting on SWC-107 on a timeline of blocks, in bins of 100 k blocks.The blue line shows the total percentage of bytecodes flagged for SWC-107 by at least one tool, while the black line shows the total absolute number of flagged bytecodes (scale on the right).While the absolute number of flagged bytecodes fluctuates (as it depends on the number of deployments), the percentage steadily drops from as high as 80 % to less than 30 %.
Regarding the overlap, the five tools hardly ever agree (top plot: smallish purple area on top).Even though Mythril does not report high numbers (upper middle plot: green line), they are not backed by the other tools (top plot: medium gray area indicating sole reporting of Mythril).

Related Work
Recent Systematic Reviews on Analysis Tools.Two studies from early 2022 show that the automated analysis of Ethereum smart contracts has still room for improvement.Rameder et al. [2022]  [2020] evaluated 18 tools regarding the ease of installation, usefulness and updates.Both studies do not assess the detection capabilities of the examined tools.
Benchmarked Evaluations.Most closely related are evaluations of tools that actually test them on a set of contracts (benchmark set).Authors of tools, however, tend to compare their own artifact to a few similar and/or popular ones.We do not consider those works since they are intrinsically biased.Among the independent evaluations, we find 11 related works [Dika, 2017, Parizi et al., 2018, Gupta, 2019, Durieux et al., 2020, Ghaleb and Pattabiraman, 2020, Leid et al., 2020, Zhang et al., 2020, Dias et al., 2021, Ji et al., 2021, Ren et al., 2021, Kushwaha et al., 2022a] of which we give an overview in Table 11.In the first two rows, we indicate the respective reference and the year when the evaluation was carried out.Rows three to five list the size of the benchmark set, separated into vulnerable and non-vulnerable contracts, or unknown number of vulnerable contracts.All references use Solidity files as benchmarks.Row six indicates the number of different vulnerabilities tested.We highlight low numbers in red and commendable high numbers in green.We also list for each tool which evaluation it was part of.We highlight the five tools most often used in light blue.In the last row, there is the total number of tools used in each study.We highlight the five references using the most tools in mid blue.
The earliest evaluation dates back to 2017 and covers four tools tested on five vulnerabilities with a benchmark set of 23 vulnerable and 21 non-vulnerable contracts.Regarding the benchmark sets, the number of contracts contained shows a large variety from only 10 to almost 50 000.The number of vulnerable contracts in the benchmark set also varies largely from 10 to 9 369.It should be noted that for benchmark sets from the wild, i.e. from the contracts actually deployed on the mainchain, the true number of vulnerable contracts and vulnerabilities contained in the contracts is yet unknown.The number of different vulnerabilities varies from 4 to 131.Most evaluations use their own taxonomy of vulnerabilities.This may be due to the lack of an established taxonomy [Rameder et al., 2022].We find a total of 20 tools mentioned in the evaluations, while each work selects a subset thereof for its tests.The number of tools tested varies from three to a maximum of 16.The tools most often included in a comparison are Mythril, Oyente, Securify, Slither, and SmartCheck.
We note that our study considers the largest number of contracts from the wild (248 328 unique contracts) and it is the only one that considers the tools Conkas, Ethainter, eThor, MadMax, Pakala, teEther, and Vandal.
Open Source Frameworks.For a large-scale evaluation, we need an analysis framework that (i) facilitates the control of multiple tools via a uniform interface, (ii) allows for bulk operation, and (iii) is open source and usable.Smart-Bugs [Ferreira et al., 2020] is such an execution framework released in 2019.It is still being maintained with 13 contributors and over 70 resolved issues.The framework USCV [Ji et al., 2021] implemented similar ideas in mid-2020.It comprises an overlapping set of tools and an extension of the ground truth set.With a total of 10 commits (the latest in mid-2021) and no issues filed, it seems to be neither widely used nor maintained.Both frameworks target Solidity source code, and thus need to be expanded to work with bytecode.

Combining or Comparing Tools Results
When comparing or combining tool results, we have to deal with two challenges: (i) different aims of tools that are reflected in the way their findings are reported and (ii) differing definitions of weaknesses (that are associated with the findings), which makes it hard to map a finding to a class (within a common frame of reference) for comparison or combination.
The tools can be divided into four groups with respect to their aim (for a specific weakness): (i) proving the absence of a property that is regarded as a weakness or vulnerability, (ii) over-reporting as to not overlook a potential weakness (aka issuing warnings), (iii) under-reporting since only those weaknesses are reported where a verification could be found (avoiding false alarms), (iv) reporting properties that are hardly a weakness (e.g.honeypots) or not necessarily (e.g.gas issues).
This distinction is important when comparing tools.It strongly affects the number of agreements.As we have seen in Section 4.4 and Section 5.1, the overall agreement is very low, which is partly due to the fact that tools address different versions, subsets or supersets of a weakness class.Considering the different aims of the tools, the low general agreement is not surprising.However, it is even low for tools with similar aims.
The aims of the tools also impact voting schemes that combine the results of several tools to 'determine' whether a contract is actually vulnerable.For over-reporting tools, it may make sense to have a majority vote.However, under-reporting tools should rather be joined than intersected.

Comparison to Source Code as Input
We selected the tools in our study for their ability to process runtime code, as our goal was to analyze contracts deployed on the mainchain, for which Solidity source code is often unavailable.Moreover, this allowed us to include Ethainter, eThor, MadMax, Pakala, teEther and Vandal, which require runtime code.The other selected tools accept both, bytecode and Solidity source code.In this section, we discuss the effect of using source code as the input.
Conkas, Honeybadger, Osiris, Oyente and Securify compile the Solidity source to runtime code and then perform the same analysis as if the latter had been the input.There are two differences, though.First, the tools are able to report the location of weaknesses within the source, as they use a mapping provided by the compiler to translate bytecode addresses back to line numbers.Second, for Solidity sources with more than one contract, the tools compile and analyze each one separately.As complex contracts are structured into several layers of intermediate contracts using inheritance, this leads to redundant work.While compilation and address mapping incur a negligible overhead, the additional contracts may lead to fewer or more findings within a fixed time budget, depending on whether there is less time for the main contract or whether other contracts contribute additional findings. 12aian and Mythril compile the Solidity source as well but proceed with the deployment code, which includes contract initialization as well.Maian deploys the contract on a local chain and checks some properties live, like whether the contract accepts Ether.Moreover, the findings are filtered for false positives by trying to exploit the contract on the chain.Mythril, on the other hand, uses the deployment code to analyze also the constructor.For both tools, resource requirements and results will vary with the chosen form of input.

Threats to Validity
Internal validity is threatened by the integration of the new tools into Smart-Bugs.We mitigated this threat by carefully following the SmartBugs instructions for tool integration and by consulting the documentation and the source code of the respective tools.Multiple authors manually analyzed all execution errors to ensure that we had configured the tools adequately.Moreover, we make the implementation and the results accessible for public inspection.
External validity is threatened by the use of single bytecodes as proxies for code families identified by the same skeleton.These representatives may not accurately reflect the code properties of all family members that are relevant to weakness detection.We mitigated this threat by RQ1.However, the random sample of 1 000 bytecodes (620 families) may have been chosen too small such that our answer to RQ1 may not generalize to all bytecodes.
Construct validity is threatened by our mapping between the detected weaknesses and the SWC classes.The mapping reflects our understanding of the weaknesses and what the tools actually detect, which may be incorrect.We mitigated this risk by involving all authors during the mapping phase and by discussing disagreements until we reached a consensus.Another potential threat is the resources, 30 minutes and up to 32 GB per tool and bytecode.This configuration is in line with related work or surpasses it.

Conclusion
In this work, we study the evolution of both, analysis tools and the findings they report.Regarding tools, we select open-source tools that take runtime bytecode as input to detect weaknesses.As for the tested blockchain programs, we aim at a full coverage of the smart contracts deployed on the Ethereum main chain up to block 14 M (48 M contracts).By considering only smart contracts with different skeletons, we manage to scale to the entire blockchain.In total, we run 13 tools on 248 328 contracts with an execution time of 31 years.
We show that skeletons are a suitable abstraction for performing bytecode analysis, particularly for tools that do not require any dead code or meta-data.Moreover, we analyze the evolution of detected weaknesses as well as tool failures and errors and investigate agreements between the tool findings.We detect a total of 1 307 484 weaknesses with most of them related to Reentrancy (14.1 %), Unchecked Call Return Value (14.0 %) and Integer Overflow and Underflow (9.4 %).Interestingly, the frequency of these weaknesses is declining over time, partly due to better developer awareness and compiler improvements, but also due to increasing tool failures.
We observe that type errors are a frequent cause of tool failure.Additionally, we noticed that execution times for failed executions increased significantly, suggesting that reasonable timeouts can be set while still obtaining useful findings (this is particularly important for CI/CD).Our study indicates that there is still room for improvement regarding automated weakness detection.
The overlap analysis revealed a low agreement between the tools.This is mainly due to their differing definitions of the weaknesses.Therefore, it is beneficial to use a range of tools even for the same weakness class, since the tools will deliver complementary results in many cases.
Our work has already contributed to the community.We contacted several tool developers by filing issues (Conkas, Maian, Mythril) as well as exchanging emails and engaging in discussions (eThor, Ethainter, MadMax, Mythril, Osiris, Vandal).The extension of SmartBugs to also accept bytecode as input is already taken up by the framework Centaur13 .
Solidity contract creating contracts during deployment (C2) as well as during runtime (C3).

Figure 1 :
Figure 1: Accumulated findings per tool over time.Each data point shows the percentage of bytecodes for which the tool reports a finding, in bins of 100 k blocks.

Figure 2 :
Figure 2: SWC classes over time.Percentage of bytecodes flagged with a specific weakness, in bins of 100 k blocks.

Figure 3 :
Figure 3: Tool failures over time.Percentage of failures encountered by the tools, in bins of 100k blocks.Note the stretched verticale scale of the lower plots.Ethainter and MadMax had no failures and therefore do not show up.

Figure 4 :
Figure 4: Tool errors over time.Percentage of errors reported by the tools, in bins of 100k blocks.Note the stretched y scale of the lower two plots.Mythril, Oyente and Vandal had no errors and thus are not depicted.

Figure 5 :
Figure 5: Agreement of the tools on the SWC classes.Each bar shows the proportion of weaknesses identified by one , two , three , and more tools.
SWC 124 -Write to Arbitrary Storage Location: The contracts flagged by Mythril are essentially a subset of those flagged by Ethainter, but a small one, as the blue part of Ethainter's bar dominates.

Figure 6 :
Figure 6: SWC-101 Integer Overflow and Underflow on a timeline of blocks, in bins of 100 k blocks.Top: Percentage of overlaps.Upper middle: Percentage of bytecodes flagged by the tools.Lower middle: Error rate of tools.Bottom: Failure rate of tools.

Figure 7 :
Figure 7: SWC-107 Reentrancy on a timeline of blocks, in bins of 100 k blocks.Top: Percentage of overlaps.Upper middle: Percentage of bytecodes flagged by the tools.Lower middle: Error rate of tools.Bottom: Failure rate of tools.
According to the Common Weakness Enumeration, cwe.mitre.org,weaknesses are flaws, faults, bugs, or other errors in software or hardware implementation, code, design, or architecture that if left unaddressed could result in systems, networks, or hardware being vulnerable to attack.

Table 2 :
Tools Selected for Study

Table 3 :
Resource Consumption and Out-of-memory (OOM) conditions

Table 4 :
Weakness Classes and Reports

Table 5 :
Mapping of Tool Findings to SWC Classes Mythril Delegatecall to user supplied address SWC 112 Mythril Dependence on predictable environment variable SWC 116 Mythril Dependence on predictable environment variable SWC 120 Mythril Dependence on tx origin SWC 115 Mythril Exception State SWC 110 Mythril External Call To User Supplied Address SWC 107 Mythril Integer Arithmetic Bugs SWC 101 Mythril Jump to an arbitrary instruction SWC 127 Mythril Multiple Calls in a Single Transaction SWC 113 Mythril State access after external call SWC 107 Mythril Unchecked return value from external call SWC 104 Mythril Unprotected Ether Withdrawal SWC 105 Mythril Unprotected Selfdestruct SWC 106 Mythril Write to an arbitrary storage location SWC 124

Table 7 :
Maintenance Aspects of Tools (checked in February 2023) some time before, the use of math libraries with the same effect had become quasi-standard.Reentrancy (107) is probably the most (in)famous vulnerability.The decrease in detection can be attributed at least partially to developers taking adequate precautions.Observation 2. Of the 37 SWC classes, 15 are covered by at least one tool, and 8 by at least three tools.For all weaknesses but one, the number of flagged contracts decreases over time or stagnates on a low level.Reasons for the decrease in detected weaknesses are unmaintained tools that do not adequately cope with newer EVM instructions as well as compilers and programmers taking counter-measures.At the end of the timeline, integer bugs (SWC 101), reentrancy (SWC 107) and block values as a proxy for time (SWC 116) are the most frequent weaknesses with a share of about 20 % each.The overall picture indicates the need for further work on methods and tools.

Table 7
Execution time.Table8gives the average runtimes in seconds for each tool.
The column Overall averages over all 248 328 runs, whereas Success picks only

Table 9 :
Findings, Errors and Failures of Tools

Table 10 :
Overlap of Tool Findings [%] describe the functionalities and methods of 140 tools (83 open source) for automated vulnerability analysis of Ethereum smart contracts.Their literature review identifies 54 vulnerabilities, with some not addressed by any of the tools.Moreover, the authors find many tools to be unmaintained.Kushwaha et al. [2022a]provide a systematic review of 86 analysis tools with a focus on 13 common vulnerabilities.For quality assessment, they select 16 tools, which they test on five vulnerabilities using a ground truth of 30 contracts.Tool Evaluations without Test Sets.In 2019, two surveys evaluate tools for vulnerability detection by installing them and working through the documentation: Di Angelo and Salzer [2019] investigated 27 tools with respect to availability, maturity, methods employed, and security issues detected.López Vivar et al.

Table 11 :
Overview of Evaluations with Benchmarks