Android decompiler performance on benign and malicious apps: an empirical study

Kargén, Ulf; Mauthe, Noah; Shahmehri, Nahid

doi:10.1007/s10664-022-10281-9

Android decompiler performance on benign and malicious apps: an empirical study

Open access
Published: 20 February 2023

Volume 28, article number 48, (2023)
Cite this article

Download PDF

You have full access to this open access article

Empirical Software Engineering Aims and scope Submit manuscript

Android decompiler performance on benign and malicious apps: an empirical study

Download PDF

2270 Accesses
Explore all metrics

Abstract

Decompilers are indispensable tools in Android malware analysis and app security auditing. Numerous academic works also employ an Android decompiler as the first step in a program analysis pipeline. In such settings, decompilation is frequently regarded as a “solved” problem, in that it is simply expected that source code can be accurately recovered from an app. On the other hand, it is known that, e.g, obfuscation can negatively impact a decompiler’s effectiveness. Therefore, in order to better understand potential failure modes of, e.g., automated analysis pipelines involving decompilation, it is important to characterize the performance of decompilers on both benign and malicious apps. To this end, we have performed what is, to the best of our knowledge, the first large-scale study of Android decompilation failure rates, using three sets of apps; namely, 3,018 open-source apps, 13,601 apps crawled from Google Play, and an existing collection of 24,553 malware samples. In addition to the state-of-the-art Dalvik bytecode decompiler Jadx, we also studied the performance of three popular Java decompilers. Furthermore, this paper also presents the findings from a follow-up study on 54,945 malware apps, where we additionally performed an analysis of the reasons for decompilation failures. Our study revealed that decompilers generally have very low failure rates, and that few failures on benign apps appear to be related to obfuscation. On malware, however, obfuscation appears to be a more prominent cause of failures, although the vast majority of malicious apps could still be fully decompiled by an ensemble of decompilers.

Understanding Android Obfuscation Techniques: A Large-Scale Investigation in the Wild

Testing android malware detectors against code obfuscation: a systematization of knowledge and unified methodology

Article 20 August 2016

AndroClonium: Bytecode-Level Code Clone Detection for Obfuscated Android Apps

1 Introduction

Decompilers, i.e., tools that can reconstruct the source code from a program binary, are ubiquitous aids when reverse-engineering malware or performing software security auditing. On the Android platform, decompilers are also used extensively to lift the Dalvik bytecode of Android apps into Java source code, prior to manual or automated inspection (Gamba et al. 2020; Chen et al. 2019; Shan et al. 2018; Tian et al. 2018; Pauck et al. 2018; Xue et al. 2017; Cen et al. 2015).

Compared to native-code decompilation, reconstructing source code from Dalvik bytecode is, generally speaking, significantly less challenging. This is because many of the hurdles of decompiling native machine code are eliminated due to the constraints placed on Dalvik bytecode by the Android system. For example, simply recovering a correct assembly-code listing from a native binary is an undecidable problem in itself, since code and data can be interspersed (Linn and Debray 2003). Moreover, as a consequence of the relatively loose structure of common file formats for executables, it can be challenging to identify all functions (as well as their bounds) in a native binary (Pang et al. 2021). Neither of these problems exist in the Dalvik bytecode setting, since all methods are encapsulated in a highly-structured container format called DEX (Dalvik Executable format). An additional challenge often faced when attempting to decompile native code is control-flow obfuscation, which is frequently used by both malware authors and legitimate software developers to prevent decompilation or disassembly (Roundy and Miller 2013; Junod et al. 2015). Similarly, several obfuscation techniques exist for the Java virtual machine (Chan and Yang 2004; Hou et al. 2006), which are able to prevent decompilation of Java bytecode back into legible source code. Such obfuscation techniques rely on introducing “fake” branches, which do not affect runtime semantics, but which prevent reconstruction of high-level control-flow constructs (e.g., by making the static control-flow graph irreducible) (Collberg et al. 1997). Applying such obfuscation techniques on Dalvik bytecode, however, is technically much more challenging (albeit not impossible), due to the so-called register-type conflict problem (Balachandran et al. 2016), which we briefly describe in Section 2. For this reason, it is more common to instead apply data-obfuscation techniques, such as identifier renaming or string encryption, to Android apps. Such techniques primarily aim to hide clues about program semantics from human analysts, rather than preventing decompilation per se.

As a consequence of the differences discussed above, the typical usage models of decompilers differ for native code and Dalvik bytecode: while it is generally recognized that native-code decompilers in many cases fail to reconstruct syntactically and semantically correct source code, and are therefore best used as an aid for manual reverse engineering, Dalvik bytecode decompilation is frequently regarded as a “solved” problem, where it can simply be expected that correct Java source code can be fully reconstructed from a DEX file. This sentiment is often reflected in the Android literature, where, for example, many works (Chen et al. 2019; Li et al. 2017; Wang et al. 2015; Gibler et al. 2012; Martín et al. 2017; Enck et al. 2011) use decompilation as the first step in an automated analysis pipeline. However, previous work (Harrand et al. 2019) has shown that decompilers for the Java virtual machine (JVM) frequently produce code with subtle syntactic or semantic errors, raising the concern that such problems also exist when decompiling Dalvik bytecode. While systems that use decompilation to extract features for approximate app similarity metrics (e.g., Li et al. (2017), Wang et al. (2015), Martín et al. (2017), and Cen et al. (2015)) might be able to tolerate minor correctness errors without critical degradation of functionality^{Footnote 1}, another potential cause for concern is the completeness of decompilation results. For example, a small-scale preliminary study on 151 open-source Android apps by Jang et al. (2019) indicated that popular decompilers frequently fail altogether to decompile a significant portion of methods in an app. Such completeness errors could potentially be even more detrimental to the reliability of automated analysis methods than minor syntactic or semantic decompilation errors.

It is clear from the above discussion that both the correctness and completeness of Android decompilation must be studied further. In this work, we have focused on the latter. As such, the first and primary research question (RQ1) that we have sought to answer is To what degree can we expect decompilers to successfully recover source code from Android apps?

Moreover, while control-flow obfuscation is presumably more rarely encountered in Dalvik bytecode than in native code, due to the aforementioned register-type conflict problem, the question remains: to what degree is decompilation-breaking obfuscation a concern when analyzing malware or commercial apps for the Android platform? We address this as our second research question (RQ2).

Here, it should be noted that Android apps can also contain native code components, whose decompilation are subject to the same challenges as with other native-code binaries, and which can also be subjected to control-flow obfuscation. However, because the limitations of native-code decompilation has already been well-studied, and because of the very different usage model for native-code decompilation, we have chosen to limit our focus to Dalvik bytecode decompilation in this study.

Our third research question concerns the performance of individual decompilers. The study by Harrand et al. (2019) showed that various idiosyncrasies of JVM decompilers can cause significant differences in relative performance between decompilers, depending on the program being analyzed. In a follow-up study (Harrand et al. 2020), they also showed that decompilation results can be combined to improve the overall correctness of recovered code. Similarly, the small-scale study by Jang et al. (2019) indicated that the same also holds true for Android decompilers. To determine if these preliminary results can be generalized, we have sought to answer the question: Do different Android decompilers tend to systematically fail on the same methods, or do their results complement each other? (RQ3)

We have addressed the three research questions above in a previous study (Mauthe et al. 2021). In addition to providing an extended presentation of the findings from that study, this paper also presents the results from a follow-up study on a large set of Android malware samples. Since our original results indicated that many decompilation failures appeared to be caused by implementation-level deficiencies, rather than fundamental limitations of the decompilation algorithms, we wanted to further study the reasons why decompilers fail. Therefore, in addition to analyzing the new dataset in the context of our original research questions, we also introduced a fourth research question: To what degree does implementation-level limitations, in contrast to fundamental algorithmic limitations, contribute to decompilation failures? (RQ4) Below, we summarize the contributions of our original study, as well as the new contributions presented in this paper.

Original Contributions

We have performed a large-scale study of the decompilation success rate (i.e., the ratio of methods for which the decompiler reports successful decompilation) for Android apps using four different decompilers. Our original evaluation was performed on three datasets, consisting of, respectively: 3,018 open-source apps from the F-Droid repository, 13,601 apps from a recent crawl of Google Play, and a collection of 24,553 Android malware samples collected between 2010 and 2016.
We have characterized the differences in decompilation success rate between the datasets, and performed a preliminary analysis of potential causes of these differences.
Furthermore, our statistical analysis was complemented with a manual analysis of a number of Android apps.

New Contributions

We recognized, as a threat to the validity of our original study, that many samples in the original malware set was quite old. Therefore, we have repeated the statistical analysis for research questions RQ1–RQ3 on an additional large set of more up-to-date Android malware, consisting of 54,945 apps, and report on how the results differ from those of the old malware set.
We have complemented the results for RQ3 with a more in-depth analysis of decompiler co-failure rates.
Finally, as the largest new contribution of this work, we have performed data-mining on all error messages emitted by decompilers, when run on the new malware dataset, in order to gain better insights into the reasons for decompilation failures (RQ4).

Additionally, we make our implementation and collected data available in the interest of open science^{Footnote 2}.

The rest of the paper is structured in the following way: In Section 2, we provide some background on Android decompilation and obfuscation techniques. We outline the methodology for our study in Section 3, and present our results in Section 4. The results of our follow-up study on reasons for failures are presented in Section 5. We discuss the findings and potential threats to validity in Section 6, and survey related work in Section 7. Finally, Section 8 concludes the paper.

2 Background

In order to make the paper self-contained, we will start by providing some brief background information on a few important concepts.

Android App Runtime Model

Android apps are developed in the Java or Kotlin languages, and compiled to Dalvik bytecode. Apps are distributed in the form of Android Application Packages (APKs), which contain one or more files of the DEX format. DEX files in turn contain a number of classes, including Dalvik bytecode for each method of a class. On Android versions prior to 5.0, Dalvik bytecode was interpreted by a virtual machine. Modern versions of Android instead use the Android Runtime (ART), which avoids the overhead of interpretation by pre-compiling the Dalvik bytecode to native code when an app is first installed.

Android Decompilation

In addition to native Dalvik decompilers^{Footnote 3}, Java decompilers can often also be used on Android apps by first converting the Dalvik bytecode into equivalent bytecode for the JVM, using a tool such as ded (Enck et al. 2011) or dex2jar^{Footnote 4}. Since the Kotlin language is designed to be fully interoperable with Java, apps written in Kotlin can generally also be decompiled into Java source code.

Android Obfuscation

Android apps frequently make use of obfuscation to prevent intellectual property theft, such as redistribution of paid apps, or ad-fraud. (The latter implies repackaging apps with modified identifier tokens for ad services, in order to gain ad revenue based on other developers’ work.) One of the most common types of obfuscation is identifier renaming, wherein human-readable identifiers for, e.g., methods or variables, are replaced with meaningless strings. This obfuscation is sometimes also applied to open-source apps, since it tends to make the final APK smaller. Another common obfuscation method is string encryption, which works by removing strings from a DEX file and replacing them with an encrypted variant. Decryption routines are then injected at the places where strings are used in the code, so that the strings can be decrypted on-the-fly during runtime. A more advanced form of obfuscation is class encryption, whereby an entire class is stored in encrypted form and reconstructed at runtime using Java’s reflection API. Packing is a similar approach to obfuscation, where all the Dalvik code of an app is stored in encrypted form, and decrypted at runtime using a wrapper program.

A common form of control-flow obfuscation works by inserting “fake” branches to random or invalid code locations, where the branches are guarded by so-called opaque predicates (see for example (Collberg et al. 1997; Linn and Debray 2003)). Such predicates are hard to evaluate statically, but always give the same outcome at runtime. This kind of obfuscation is applicable both to native code and to bytecode for the JVM, and provides a strong defense against decompilation, as it often cannot be automatically broken without resorting to prohibitively expensive methods, such as symbolic execution (Ming et al. 2015). On Android, however, this technique is considerably harder to implement due to the register-type conflict problem. While the JVM is stack based, the Dalvik virtual machine is register based. During compilation to native code, the ART compiler will check that there are no instances where a register holds data of conflicting types along any control-flow path in a method. (For example, if an integer is written to a register at some point, and at a later point that register is read as a floating point number, a register-type conflict is reported, and compilation is aborted.) “Fake” branches stemming from control-flow obfuscation frequently cause this type of conflict. While methods for partially overcoming this problem have been described by Balachandran et al. (2016), it is unclear to what degree, if any, this type of anti-decompilation technique is used in the wild for Dalvik bytecode.

3 Methodology

In this section, we outline the methodology of our work. We begin with a detailed description of the approach used in our original study, followed by a discussion of some of its limitations. Finally, we describe the methodology used in our follow-up study.

3.1 Original Study

As depicted in Fig. 1, we begin by gathering APKs from three different sources, in order to study decompilation characteristics of different kinds of apps. We collected 3,018 open-source apps from the F-Droid repository^{Footnote 5} and 13,601 apps from the Google Play store. Finally, we used the existing Android Malware Dataset (AMD) compiled by by Wei et al. (2017), consisting of 24,553 Android malware samples collected between 2010 and 2016. While the samples in this dataset are quite old, a benefit of the AMD dataset is that each sample is labeled with its family, allowing us to compensate for bias due to some families being over-represented in the dataset.

3.1.1 Gathering Apps

Retrieving apps from the F-Droid repository is quite straightforward, as all apps can simply be enumerated and downloaded. The Google Play store, however, does not allow downloading apps in bulk. Therefore, similarly to previous works, we had to implement a custom crawler by partially reverse-engineering the internal Google Play API. As our aim was to collect the most popular applications in the store (i.e., the ones with the largest user-exposure), we used an approach similar to, e.g., Backes et al. (2016) and crawled Google Play by category. Our crawler first retrieves the current set of thematic categories present in Google Play and then goes on to query each of those for their respective subcategories. These subcategories are not thematic, but instead are of a commercial nature, displaying the highest grossing, highest selling and most popular applications. As we only want to include free applications in our dataset^{Footnote 6}, we omit crawling the highest selling applications and focus on the other two subcategories. The crawler then queries the store API for all applications contained in each subcategory, and downloads all of them. This way, our set of apps will consist of the most popular apps in each category.

As the top grossing categories may still contain paid apps and some applications are present in multiple subcategories, we needed to do further pruning of duplicates and apps that failed to download as we did not purchase them. After pruning, we ended up with the aforementioned number of unique apps from 34 categories.

3.1.2 Measuring Decompiler Success Rate

In the next step, each app is decompiled with four different decompilers. In addition to the state-of-the art native Android decompiler Jadx^{Footnote 7}, we also used the three popular Java decompilers CFR^{Footnote 8}, Fernflower^{Footnote 9} and Procyon^{Footnote 10}. Before invoking the Java decompilers, we convert each app’s Dalvik bytecode to JVM bytecode using dex2jar. In case of failures, the error messages from each decompiler are fed to a custom parser that records the methods that failed to decompile. When the analysis of one app is complete, all output artifacts, such as log files and decompiled source code, are discarded in order to avoid excessive disk usage. Since decompilation sometimes takes a very long time for some apps, it was necessary to implement timeouts. We used a timeout of 5 minutes for dex2jar, and also set the timeout for each decompiler to 5 minutes.

Since packing effectively hides an app’s code from static analysis, decompilation is of little use for packed apps, unless the app is first unpacked by manual analysis. For this reason, we also wanted to detect if an app had been obfuscated with a packer. To this end, we use the APKiD tool^{Footnote 11}, which can detect signatures of many popular packers.

In order to compare the per-method performance of the decompilers, the final step of our approach is to unify decompiler outputs. We first extract signatures for every method in an app, using apkanalyzer^{Footnote 12} from the Android SDK. We use this list of method signatures as a reference point, and match these signatures with the failed methods of each decompiler. The total number of methods per app, and the size of each method (i.e., the size of the method’s bytecode) is also determined using apkanalyzer. Since all decompilers use slightly different formats for method signatures in their error reporting, we first preprocess the failed signatures to have a unified format. We also had to modify CFR somewhat, so that it outputs sufficient information about methods that it failed to decompile. Finally, we perform a simple textual matching of the unified signatures.

Our analysis platform was implemented in around 3,800 lines of Python. Crawling the datasets took about one week, and performing the analysis of all apps required around 4 weeks when running in parallel on three machines, each fitted with an 8-core Intel 9700K CPU.

3.1.3 Limitations

One general limitation of our approach is that we only match failed methods between the decompilers. In other words, we assume that a decompiler will always either successfully decompile a method, or emit an error message in a predictable format. If there are corner cases where this assumption does not hold, i.e., where decompilers silently “ignore” methods, we would not detect this as a failure, but would simply assume that the method (as reported by apkanalyzer) was successfully decompiled. The reason why we did not opt for the opposite approach of matching successfully decompiled methods is that this would be considerably more technically challenging, as it would require parsing the decompiled source code. Apart from substantially increasing the processing time required for each app, accurately recovering method signatures from the reconstructed source code could also potentially prove challenging, since the decompilation output would likely not always be fully compliant Java code.

There are also some problems that stem from limitations in the tools we use. These are summarized below.

Challenging Java Language Features

The way decompilers handle some specific features of Java reduces the accuracy of our signature matching. Inner classes is one such case. While apkanalyzer reports the fully quantified names of inner classes, some of the decompilers only report the method name and containing source code file of a failed method in an inner class. Therefore, we are forced to over-simplify in these cases, and consider all methods of inner classes with the same name in one file as matching, by omitting the inner class quantifier reported by apkanalyzer. This sometimes leads to an over-approximation of failures, namely when there are methods in multiple inner classes whose signatures match a decompilation failure. For example, consider a class A with two inner classes 1 and 2 where all three classes define a method void m(boolean). This might seem like an artificial case, but it often happens if classes 1 and 2 extend class A. In this case, apkanalyzer would output three different quantified method signatures:

Table 1

Android decompiler performance on benign and malicious apps: an empirical study

Abstract

Similar content being viewed by others

Understanding Android Obfuscation Techniques: A Large-Scale Investigation in the Wild

Testing android malware detectors against code obfuscation: a systematization of knowledge and unified methodology

AndroClonium: Bytecode-Level Code Clone Detection for Obfuscated Android Apps

1 Introduction

Original Contributions

New Contributions

2 Background

Android App Runtime Model

Android Decompilation

Android Obfuscation

3 Methodology

3.1 Original Study

3.1.1 Gathering Apps

3.1.2 Measuring Decompiler Success Rate

3.1.3 Limitations

Challenging Java Language Features

Other Tool Limitations

3.2 Follow-Up Study

4 Results

4.1 Basic Dataset Statistics

4.2 Decompiler Performance

4.3 Failure Rate Diversity

4.4 Differences Between Datasets

4.5 Exploring Reasons for Differences

4.6 Manual Analysis

F-Droid

Google Play

AMD

5 Reasons for Failures

5.1 Classification of Exceptions

5.2 Differences in Co-Failure Rates

5.3 Resource-Exhaustion Failures

6 Summary and Discussion

6.1 Summary of Results

6.2 Threats to Validity

6.3 Future Work

7 Related Work

8 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation