Effects of Program Representation on Pointer Analyses — An Empirical Study

Static analysis frameworks, such as Soot and Wala, are used by researchers to prototype and compare program analyses. These frameworks vary on heap abstraction, modeling library classes, and underlying intermediate program representation (IR). Often, these variations pose a threat to the validity of the results as the implications of comparing the same analysis implementation in different frameworks are still unexplored. Earlier studies have focused on the precision, soundness, and recall of the algorithms implemented in these frameworks; however, little to no work has been done to evaluate the effects of program representation. In this work, we fill this gap and study the impact of program representation on pointer analysis. Unfortunately, existing metrics are insufficient for such a comparison due to their inability to isolate each aspect of the program representation. Therefore, we define two novel metrics that measure these analyses’ precision after isolating the influence of class-hierarchy and intermediate representation. Our results establish that the minor differences in the class hierarchy and IR do not impact program analysis significantly. Besides, they reveal the sources of unsoundness that aid researchers in developing program analysis.


Introduction
Researchers have proposed various approaches to enhance the precision and soundness of static analyses [6,9,10,14,17,26,30,31]. They use program analysis frameworks to prototype and evaluate their algorithms. A program analysis based on declarative specifications (a growingly popular implementation paradigm) uses these frameworks to extract fundamental dataflow relations and feeds them as the ground facts to a Datalog engine.
Program analysis frameworks, primarily Soot and Wala, are being increasingly adopted in program analysis [11,31,40]. These frameworks provide APIs, which abstract internal program representation. However, program representation in these frameworks is heterogeneous in many aspects. A few of those are: -Intermediate Representation (IR). The intermediate language for program representation is an abstraction of the object code (bytecode) or source code. It removes syntactic sugar from the language and transforms it into a (minimal) core language. Thus, analysis developers can focus on the core language features to define their analysis. -Modeling of libraries in analysis scope. Real-life applications are seldomly developed from scratch; instead, they reuse library modules. Whole-program analyses consider these libraries for soundness in terms of the class-hierarchy, which forms the analyses' scope. Users can tune the scope to favor scalability over soundness. -Heap Modeling. Heap modeling is the technique to model dynamic heap allocation statically. Precise heap modeling is undecidable; therefore, analyses use approximations to keep it decidable [20]. Apart from these approximations, optimization may choose to keep a low memory footprint at the cost of precision and soundness. These factors influence the precision, scalability, soundness of the analyses, and at the same time, impede a fair comparison of analyses. Earlier research (Späth et al. [29]) was concerned about the validity of results when comparing two analyses frameworks. Reif et al. consider the comparison of different frameworks "bogus" [21] at the outset. Although many earlier works have proposed techniques to enhance scalability and precision, little to no work was done on how program representation influences program analyses. As a result, a comparison of new analyses with existing analyses suffers from a threat to validity that might have been overlooked. In this work, we fill the gap with an empirical study of these aspects of program analysis frameworks.
We choose pointer analysis for this study. Pointer analysis computes the heap locations referred by program variables and builds the foundation for many others, such as alias analysis, type-state, or program slicing. To evaluate intermediate representation and library modeling, we choose Doop, a state-of-the-art pointer analysis framework and compare its analysis for different frontends. For the third aspect, heap modeling, we compare the pointer analysis of Wala's (a state-of-the-art program analysis) framework with Doop using Wala's frontend, i.e., leveraging the identical intermediate representation.
A challenging aspect of this work is that the existing notions of precision for pointer analysis were insufficient. The computation of these metrics does not isolate single aspects of pointer analysis but rather combines all effects. For example, the average points-to set size is influenced by all three of the aforementioned aspects. It is difficult to determine the effect of each aspect by only looking at the score. In this work, we counteract this problem by introducing metrics that isolate a particular aspect under study and nullifies the effect of others. Therefore, we define two novel metrics in section 3.1, one for measuring the effects of libraries to enable a fair comparison among frameworks. To the best of our knowledge, it is the first study that evaluates the impact of program representation on pointer analysis. Precisely, in this paper, we make the following contributions: -We defined two metrics for evaluating each aspect in isolation, one for modeling of library classes, the other for IR. -We evaluated the differences in library modeling and found that these have little influence on program analyses. Additionally, we discovered sources of unsoundness in these frameworks. -We evaluated the precision for different IRs and found that they have no impact on the precision of virtual method call elimination. -We empirically found differences in heap abstractions even for analyses claiming the same levels of context-sensitivity regarding the types of heap objects.
In summary, our empirical study dispels the threats to the validity of the results of existing works posed by these differences of frameworks. It also discovers novel sources of unsoundness and imprecision in existing frameworks that provide suggestions that users/developers of these frameworks could incorporate into their analyses. Although we focus on pointer analysis in the paper, our results are, in principle, generalizable to many other static analyses, as the findings presented in this paper also hold for these. We have made the artifacts available on https://github.com/jpksh90/pointeval to facilitate reproduction.

Background and Motivation
The goal of pointer analysis is to determine which objects a variable may refer (point) to at runtime. A points-to set is a static approximation of this question, which maps variables to objects that are allocated on the heap (heap objects). More precisely, if V is the set of variables in a program, and H is the set of heap objects, then points-to : V → P(H). points-to(v) returns the set of heap objects in H referred by v.
Doop is a framework that exclusively focuses on pointer analysis, defines the analysis' inference rules in Datalog [41], and is in active development. It supports tuning of the analysis to adapt for various factors of precision (and scalability). Doop leverages the program synthesizer Soufflé [12,22] to resolve points-to according to the inference rules and the ground facts, which are derived directly from the program.
Wala [37] and Soot [28] are general-purpose program analyzers providing some pre-defined analyses and APIs for the development of custom analyses. Wala comes with various pre-defined pointer analyses [39], some of which feature novel optimizations to enhance scalability.
A context-sensitive analysis improves a pointer analysis' precision by discerning method calls based on their calling contexts. Popular notions of contexts are based on method callsites [23] (callsite-sensitive), invoking objects (objectsensitive) [19], or hybrids thereof [13].
In the sequel, we explain the need for this study by exemplifying the three factors that influence the results of pointer analyses.

Intermediate Representation
Many program analyses tools leverage an intermediate representation (IR) instead of the actual source or bytecode for analysis. IRs remove syntactic sugar from the source code and make it amenable to analysis by focussing on the fundamental operations. Popular strategies for IR generation are based on threeaddress code or static single assignment (SSA) form [4]. By default, the Soot framework uses a three-address-based IR (Jimple) [35], while Wala uses a SSAbased IR [38]. Both IRs are register-based [36,38], and hence introduce synthetic variables to mimic the stack-based Java bytecode. Doop can be configured to leverage either Jimple or Wala's IR as a frontend for program representation.
Consider the code example in Listing 1.1 and its Jimple IR depicted in Listing 1.2. The main method declaration (line 2) translates to the almost identical line 3 in the IR, whose parameter is translated to the variable @parameter0 (line 6). Due to the additional local variable r0 (line 4), the single main method argument translates to two variables in the IR. The invocations of the static method getInstance (lines 3 and 4 of Listing 1.1) are translated to the corresponding operation code staticinvoke with the method name and arguments. The newly allocated objects returned from these factory method invocations are stored in the variables r1 and r2.
Wala's IR generation differs significantly from Soot (see Listing 1.3). As an SSA-based IR, it does not assign names to method parameters and variables but ordinal numbers (starting from '1 ') called variable numbers (we prepend 'v ' to these numbers for clarity). Thus, the receiver object (this reference in Java), or the first parameter in the case of a static method is (silently) assigned the number v1. Further method parameters are assigned subsequent variable numbers, succeeded by local variables. Again, the static method calls to the method getInstance are translated to invokestatic, where v3 and v6 hold the (implicitly defined) constant arguments 6 and 7. The objects returned from the factory method invocations are stored in the variables v5 and v8. Potential exceptions thrown in the invoked methods are stored in v4 or v7, respectively.
The differences in program representation influence the metrics of pointer analysis: We analyzed Listing 1.1 context-insensitively with Doop, using Jimple and Wala's IR. The results are shown in Listing 1.4: The main method parameter object «main method array» is referred by one variable in Wala (line 2) but two variables in Soot (lines [4][5]. Even though the average points-to set size is 1 for all variables in Listing 1.4, we found noticeable differences in the average points-to set sizes in other program's analyses, with Soot's frontend the average size of the points-to set being 2.07 for 3328 variables, and 1.95 for 2298 variables using Wala's-Jimple again created more variables than Wala. These subtle differences in program representation affect the average points-to set size, and it is unclear whether these two numbers are in fact comparable. In this work, we aim to investigate the impact of IRs on the precision and scalability of the analysis (Section 4.3).

Static modeling of libraries
As a whole program analysis, a pointer analysis does not only requires knowledge of the program to be analyzed but also the library classes, especially those related to the runtime. For example, a whole program analysis of a Java application would require the runtime libraries, such as those in rt.jar, and other dependent libraries, bundled with the application. Analysis frameworks such as Soot and Wala construct the class hierarchy based on all classes present in libraries and the application. They can also remove "irrelevant" classes, favoring scalability over soundness. Interestingly, we found cases where some frontends do not load all of the required classes, which induces discrepancies when comparing the analyses.
Consider the program shown in Listing 1.1. To corroborate our intuition, we analyzed this program context-insensitively with Soot's and Wala's frontends. Using the former front-end, Doop loads 3,837 classes and computes the analysis with an average points-to set size of 2.07. With Wala's front-end, it loads 19,927 (~5×) classes for analysis with an average points-to set size of 1.95. Further investigating the types of heap objects, we found that Doop with Wala's IR contains objects of the class java.security.PrivilegedActionException, which is absent in the analysis with Soot. Note that our simple program contains no instance of that type, so it must stem from analyzing libraries. In another instance, Soot loads the classes from javax.crypto, whereas Wala does not. In this research, we examine the imprecise modeling and discover possible implications on precision and soundness (sections 4.1 and 4.2).

Heap Abstraction
Heap abstraction is an important aspect of pointer analysis and determines how object allocations are statically represented in the analysis. One simple approach is to create a unique representation for each object allocation site in the program (allocation site abstraction). However, at runtime allocation sites can be executed more than once, creating several objects that are then represented by the same abstract value. As an example, consider the object allocation (line 8) of Listing 1.1, represented via a single abstract object, say a@8. In the main method the newly allocated objects returned by getInstance are captured by the variables a and b, which would both refer to the abstract object, a@8 in the result of the pointer analysis. Thus, a and b are spuriously considered aliases (i.e., refering to the same object.) This imprecision stems from ignoring the calling-context of getInstance (context-insensitive heap abstraction).
A context-sensitive heap abstraction (a.k.a heap cloning) discerns the abstract 3 heap-objects based on the calling context, associating the calling context with the heap object to distinguish the allocations in a pair allocation site, call stack . Thus the allocation at line 8 is represented as two heap objects, a@8, 3 and a@8, 4 . Without loss of generality, the length of the call stack can be increased to any finite number, lest the analysis be undecidable. All state-of-the-art pointer analysis frameworks offer context-sensitive heap abstraction with a finite context length.
The discussion above demonstrates how the choice of heap abstraction can (potentially) influence pointer analysis. Therefore, in this work, we study the frameworks' heap abstractions. We conducted a preliminary study to gain initial insights and to validate our intuition, and context-sensitively analyzed Listing 1.1 with a one-call-site context-sensitivity in Doop with Wala's IR, and the one-call-site sensitive analysis of the Wala framework. Both of these analyses use a context-sensitive heap abstraction with context length of one. In spite of that, Wala creates 17 objects while Doop creates 133 objects (~7×). The average points-to set size varies between 1.55 for the analysis provided by Wala and 1.62 for Doop with Wala's IR 4 . Thus, we can see that even with the same level of sensitivity in heap abstraction (and IR), analysis results depend on the framework used. Manual inspection revealed that Wala selectively uses the context-sensitive heap abstraction, applying contextual heap abstraction only to non-library classes while treating the library's objects context-insensitively. Out of the 17 heap objects, Wala uses context-sensitivity for only 6 objects. In contrast, Doop leverages context-sensitivity for all heap objects, including the library's objects. These initial insights motivated us to analyze the influence of heap abstraction on precision and scalability in more detail in Section 4.4.
To summarize, the parameters for program analysis such as IR (Section 2.1), static modeling of libraries (Section 2.2), and heap abstraction (Section 2.3) affect the precision and scalability of a pointer analysis. Based on initial insights, we analyze the influence of the mentioned parameters using different frameworks, frontends, and on a larger and diverse set of benchmark applications.

Metrics Used
The precision of a pointer analysis has been defined in numerous ways in the literature. Some of the metrics for precision available in the literature are the average size of the points-to sets, the number of call-graph edges, and the number of resolved virtual calls. These metrics are not clearly superior to one another but rather tailored to specific clients, for example, the latter is leveraged by compilers in devirtualization of virtual method calls.
All of these metrics reflect how precisely the analysis computes the points-to sets (sets of heap objects referred by a variable). For example, whether or not a virtual call can be resolved depends on the heap objects' types in the points-to set of the target variable. If there is only one type (or subtypes thereof that do not redefine the virtual method) then the virtual call is resolvable. Therefore, the precision of a client analysis depends on how precisely the points-to set for each variable in the program can be resolved, in other words, how low the value of the average points-to set size is. An average size close to one is considered the hallmark of pointer analysis [27].
Therefore, we leverage the wide-spread metric of average points-to set size for our evaluation, i.e., the ratio of the total sizes of the points-to sets to the total number of local variables [26,34]. It permits a client-agnostic comparison of the pointer analysis, which generalizes our evaluation results to any specific analysis. We refer to the average points-to set size as precision in this paper. Note that the actual precision of the analysis is inversely connected to the average points-to set size: A lower precision value (i.e. average points-to set size) implies a higher precision of the computed analysis result, as precise analyses aim at excluding unrealizable (at runtime) allocation sites from the points-to sets of variables.
An IR may create many synthetic variables, among other reasons for method parameters or for φ-nodes at control-flow joins of SSA-form. For example, threeaddress code re-uses the same variable in assignments in the if and else blocks of a conditional. However, SSA-based IRs insert a synthetic variable in a φ-node at the control-flow join to select one of the distinct variables of the respective blocks. The presence of synthetic variables in IRs impedes the comparison of different analyses using the average points-to set size, as averages depend on the (unequal) number of variables. Therefore, we devise heuristics to establish comparability of our metrics for different IRs.
Another challenge in this work is inferring the impact of each analysis parameter on its precision. Computed at the end of the analysis, the average points-to set size loses information on the contribution of a particular aspect of pointer analysis. Therefore, we require a fine-grained metric to quantify the precision for each parameter. We propose two such techniques, one for the class hierarchy and the other for the intermediate representation.
Class Hierarchy The analysis of the program's class hierarchy builds the foundation for inferring relevant variables and heap allocations. However, each framework leverages a particular strategy to infer classes that contribute to the program's semantics. Adding irrelevant classes to the class hierarchy may manifest into a synthetically precise analysis, as these classes add to the total number of variables (which will all be pointing to an empty set), thus potentially decreasing the average size of points-to sets. Some of these variables and heap allocations are not part of the actual code executed at runtime, but rather arise out of an imperfect model of the program analysis framework's frontend. Here, we study the variables and heap objects stemming from the additional classes exclusive to a framework.
We first instrument the Doop framework to log the class hierarchies and compare the class hierarchies obtained using Soot and Wala as frontends, which yields the classes exclusive to each of the frameworks. CH soot and CH wala denotes the set of classes in the class hierarchies of Soot and Wala respectively. CH common = CH soot ∩ CH wala is the set of classes common to both frameworks. We define CH-precision in terms of the average points-to set size restricted to variables defined in methods of CH common .
If an analysis does not contain any exclusive classes or all of their variables (and corresponding heap objects) belong to the types present in the set of exclusive classes, CH-precision equals the average points-to set size.
Intermediate Representation (IR) The choice of IR determines a program's representation but retains the program's semantics, particularly with respect to heap allocations. Thus, different IR's can differ in the number of variables but will not introduce additional heap objects (e.g. Listing 1.4).A fundamental difference between Soot's Jimple and Wala's SSA-based IR is that SSA creates unique variables for each variable definition, while three-address code does not. Rendering our precision metric comparable for structurally different IRs is challenging, as tracking which variables correspond to each other is technically involved and may not be unique. Therefore, we rely on a heuristic to determine comparable variables. We motivate the heuristics considering two different IRs for the main method in Listing 1. To determine the set of interesting methods (M) we leverage the logs from pointer analyses and segregate the variables in the logs according to the declaring method (m). If the sizes of the corresponding sets differ for a method m, it is considered interesting. (M is confined to the set of methods defined in CH common to exclude the exclusive classes.) Subsequently, we determine the points-to relation for the variables in M.
Simple average of the heap objects and number of variables is insufficient for comparing the precision of the analysis between two IRs. Differences in class hierarchies and aliasing generates new variables, which makes the ratio incomparable if the heap objects are not same. To circumvent this problem, we combine average points-to set size with ideas from virtual call resolution. The number of virtual call sites in a program is identical irrespective of the differences in program representation (caused by aliasing and redundant variables). Therefore, we receive a fair comparison if we restrict the average point-to set size to the target variables of virtual method calls. We define a new metric, average devirtualized heap objects (H f v ), which is the ratio of the total size of points-to sets of target variables at the virtual call sites to the number of virtual call sites.

RQ1: Class hierarchy differences with benchmarks
We captured the class hierarchies considered by the analyses to determine the differences. We instrumented Doop to log the classes considered during a (contextinsensitive) analysis, which yields the complete class hierarchy. In order to investigate whether the class hierarchy depends on the frontend, we performed this experiment with Soot and Wala as frontend 5 . Table 1 lists the differences in the class hierarchies using Soot and Wala. On an average, Wala exclusively contains~13,994 classes in its class hierarchy. The number of classes exclusive to Wala range from 12,524 (Xalan) to 16,707 (Tradebeans). Soot's class hierarchy on average contains 26 classes not present in Wala's, ranging from zero to 62.
In the case of PMD and H3, Soot's class hierarchy contains only a single additional class, Jython has an additional 2 classes. Eclipse, Lusearch, and Luindex contain 62, 53, 53 additional classes, respectively. In the remaining cases the class hierarchy in Soot is strictly a subset of Wala's. In next RQ, we will study the impact of these additional classes on the precision and scalability of the analysis. 5 Note that Soot and Wala provide options to exclude certain classes from analysis (to, e.g., exclude library classes). For a fair comparison we ignore this feature and compute the whole class hierarchy including libraries. Study Setup We have used the var-points-to relation, which maps all variables and context pairs to their resolved pairs of heap-object and context. We select those variables that originate from classes common to both frameworks (Section 4.1) and query their points-to information. We then compute the CH − Precision based on Definition 1.

RQ2: Precision differences with class hierarchy
Results Table 2 presents the results of the analysis (for one-callsite, one-object, and two-object context-sensitivity) for the objects and variables belonging to exclusive classes present in Wala (only non-zero values included). Note that the two-object sensitive analysis did not terminate for Eclipse and Jython, therefore, these are not presented in the table. In one-callsite and one-objects analysis, Table 2 lists six out of eleven benchmarks contain variables that belong to the exclusive class hierarchy. The remaining benchmark applications show no differences in the number of variables and heap-objects, despite the presence of additional classes. It demonstrates that the additional classes loaded by the these frameworks have no influence on the precision of these benchmarks. The third and fourth columns of Table 2 list the number of variables (in principle, variable-context pairs) and heap objects belonging to the set of exclusive classes, respectively. In all analyses, all but one benchmark have a higher average points-to set size for exclusive variables than the general average. Tradebeans only creates 3 additional heap objects with Wala' frontend, therefore the analyses are almost identical for both frontends. The average points-to sets for exclusive classes for bigger benchmarks such as Eclipse and Jython are outliers, showing very high averages. Still, the contribution of exclusive classes' heap objects and variables is negligible compared with the total heap objects of these benchmarks.
The eighth and ninth columns depict the CH-precision and the original precision for the analyses. We observe that the CH-precision is slightly lower than the precision for all benchmarks but tradebeans, which originates from the addi- Table 2: Differences in precision in the presence of additional objects in class hierarchy (Wala). HO denotes the sum of number of heap objects in points-to set. CP wala is the precision score for variables in CH common .   (Table 3), the CH-Precision differs from Precision only for the benchmark Eclipse, for the other benchmarks the analysis does not contain any objects where the type belongs to the exclusive classes of the frontend. However, it is difficult to compare the precision of Soot v/s Wala on CH-Precision score due to differing variable numbers for the same benchmark application.

Finding 1 : Differences in class-hierarchy negligibly impact the pointer analysis precision (and thus client analyses).
Soundness In our observation, the Wala frontend takes the internal Java libraries into account. We find heap objects belonging to libraries such as sun.nio.fs, sun.util.resources, sun.security, and sun.nio.cs, which are internal libraries used by the JVM. Soot, on the other hand, does not model these libraries for analysis.
Comparing the class hierarchies of the analyses using Soot and Wala, we observed that the class hierarchy using Soot as frontend is a subset of Wala's for all However, results from the analyses with Wala contain heap objects from the internal libraries such as sun.util.*, which are not present using Soot. It shows that the class hierarchy model is unsound in both frontends, as both lack some of the classes loaded by these benchmark applications at runtime.
Our study reveals that library modeling in both Soot and Wala is unsound even for (non-native) Java objects, shown by the presence of heap-objects belonging to the exclusive classes of Soot and Wala.

RQ3: Precision for IR varies with the framework Study Setup
The study setup is similar to Section 4.2. We use the application's var-points-to sets, i.e., the relation of variables and heap objects excluding the library objects. From the results of the three analysis sensitivities, we extract the set of interesting methods (M, Def. 3) and compute the average devirtualized heap objects score for the virtual calls in interesting methods. We use the Jimple IR (--no-ssa option in Doop), and Wala's IR (--wala-fact-gen option in Doop) for evaluation.
Results Table 4 reports the number of interesting methods and total methods resolved using both frontends. Note that the number of interesting method is identical for both frameworks for the same type of context-sensitivity. The number of reachable methods in each analysis differs, just as the number of distinct methods signatures discovered in each framework (columns Soot, Wala in 1-CS, 1-OS, 2-OS 6 ). However, deriving a relationship between those is impossible, as analyses such as one-call-site and one-object are not comparable. In all cases, we observed that the majority (~90%) of the methods are interesting. Therefore, we cannot ignore the significance of this aspect.
Interesting methods are difficult to ignore because of their sheer presence in the benchmarks applications. Table 5 presents the differences in the average devirtualized heap objects for Jimple and Wala IR. Although the number of variables and abstract heap locations are dependent on the IR, we did not observe many differences between those when restricting ourselves to target variables of virtual method calls, which corresponds to our intuition. The differences in the H f v values for both IRs Overall, the values from Soot IR were smaller than those of Wala, implying that devirtualization in Soot is either slightly more precise or slightly less sound than in Wala, however, the differences are minor in the majority of the cases. In conclusion, the choice of IR shows little to no impact on the precision of pointer analysis. In the sequel, we describe one such case study where the difference in H f v is approximately two, which is a significant figure as compared to others. Finding 2 : IR has negligible impact on the precision of pointer analysis at least for the devirtualization client.
Case Study-Xalan To further investigate the differences, we chose Xalan using a one-call-site analysis as the H f v values for Soot (7.45) and Wala (9.44) display the highest difference among all benchmarks. The number of heap objects in both cases differs significantly, with Soot having 43K heap objects, and Wala having 55K heap objects for a comparable number of virtual calls (5,832 vs. 5,850).
To examine the heap objects, we collected their class types. We observed that the types of some of these objects belongs to the classes in CH soot \CH common or CH wala \CH common . Listing 1.5 depicts the differences in heap objects created by these frameworks.
We also discovered (potential) sources of imprecision and unsoundness in both analyses. Table 6 lists methods and exceptions missed by both Soot and Wala frameworks. Note that these methods and exceptions belong to the common class hierarchy. We observed that Wala has precise exception modeling compared to Soot. For other virtual methods invocations, we compared the runtime call-graph to the static call-graph. In our observation, both Wala and Soot are unsound, as demonstrated by the absence of certain method calls in the callgraph for both analyses. In addition, Wala imprecisely includes xerces.xml.dtd. XMLDTDLoader() into its call-graph (which at least in our experiments was not executed at runtime).
Apart from reflection, imprecise/unsound virtual call resolution also induces imprecision/unsoundness into the analysis. In what follows, we present the results of our study. We first present the differences in the number of heap objects and, subsequently, delve into its implications.

RQ4: Heap abstractions in pointer analysis frameworks
Differences in the heap objects For evaluation, we extracted the heap-objects created in Wala's and Doop's analyses and observe huge differences in the number of heap objects created. Intuitively, using the same level of heap-sensitivity (heapcloning) should create the same number of heap objects. However, in certain cases, the number of heap objects in Wala exhibits a factor of~14 compared to those in Doop (columns 2 and 3 in 7). (Note that eclipse and jython are elided, as the analyses did not terminate within the time budget owing to the large file size (~100GB).) Therefore, the heap abstractions of these analyses are not comparable, although superficially they look similar.
Subtle optimizations also manifests into imprecise heap modeling even though, at the outset, they look similar.
To investigate this further, we compared the the types of the heap objects. Our study shows that the set of types are not even consistent using the same frontend! In many cases the types of objects analyzed by Wala is approximately four times those in Doop (columns 4 and 5 in Table 7). The differences in heap abstraction for application level objects build the reason for this.
Application level objects Application level objects, i.e., the heap objects created due to allocations within the program (rather than libraries.) In three out of eleven benchmarks we observe that Doop's analysis is lacking application level classes that Wala reports. We found corresponding allocations on a manual inspection of the source code. For example, in avrora, the analysis in Wala allocates heap objects of BRNE_builder [8], which are not present in Doop's. Similar cases can be found in PMD and Xalan. However, owing to the limitations of the program representation, we could not determine the precise reason for the unsoundness. Pointer analysis uses an IR based on a control flow graph (CFG) rather than source code. Being a lower level representation of the program source code the IR mangles variables names. Therefore, a one-to-one correspondence between the IR's variables and variables in source code is not trivial.
Finding 3 : Heap modeling is not similar even for allocations within the application scope. Wala handles application levels objects more precisely than Soot in our evaluation.

Threats to Validity
Naturally, the technique used relies on the precise handling of reflection calls and other dynamic features of the languages such as dynamic proxies. Other than that, handling of native calls could alleviate the unsoundness of the analyses. Analysis of native calls could infer the native objects in JVM missed by the Soot framework. Here, we have used the TamiFlex framework for handling reflection calls. Other approaches have improved the reflection handling [10,[15][16][17][18]25]. To convince ourself, we experimented with one of the state-of-the-art techniques, i.e., reflection with matching substring resolution [10]. However, we did not find any significant differences in results. Another limitation of this study is the unsoundness from ignoring the native library calls in static analyses. Few of the sources of unsoundness discovered stem from the native calls. Recently, Fourtounis et al. [7] proposed a technique for resolving native calls in Java. However, at the time of writing this paper, the technique was not available. Further, our analysis in Section 4.3 is based on test-cases which may not reflect all possible executions of an application.
Our study also involves hours of manual evaluation which can be subject to bias. To counteract it, we did a manual inspection of the source code, especially for the sources of unsoundness. We had rerun the benchmark applications with valid inputs to determine to compare and reassert that the objects are actually allocated during runtime.

Related Work
Pointer analysis tools Pointer analysis has garnered significant interest in the last decades, focussing on scalability, precision, and soundness. The Doop system used in this paper results from years of research on declarative-style pointer analysis [1,3,10,24,26]. Similarly, the Wala framework was a result of an industrial project and, unlike Doop, follows an imperative paradigm. The underlying program representation comes with many prior assumptions mentioned. In this work, we study the effects of these assumptions on program analysis.
Empirical studies on pointer analysis Recent empirical studies focussed on the soundness limitations from dynamic features of languages in existing pointer analyses and call-graph construction as pointer analysis and call-graph construction are closely related static analyses and are mutually dependent. Dietrich et al. [5] proposed automated and manual techniques to generate unsoundness oracles to test static analysis. Sui et al. [32] present the causes of unsoundness in static analysis frameworks (Soot, Wala, and Doop) due to the dynamic features of languages. Rief et al. [21] did a comprehensive study, focussed on features in Java 9, for call-graph generation algorithms and expose the problems in the state-of-the-art esp. related to method calls in the Java runtime. Our work is orthogonal: we evaluate the influence of program representation on program analyses. Here, we rather focus on the program representation in static analysis frameworks and also the unsoundness arising out of it. Our study is also extensible for Java 9.
Sui et al. [33] evaluated the recall of call-graph construction and present how it impacts the algorithms in practice. Their evaluation expose the problems in the state-of-the-art esp. related to method calls in the Java runtime. Our unsoundness results concur with theirs. Here, we have focussed on program representation rather than the dynamic features of the language, which are hard to analyze for static analyzers. Further, our work features two novel metrics apart from the standard precision and recall, to measure the impact of different aspects of program representation.

Conclusion
This paper reports the effects of program representation on program analysis. Our metrics makes it possible to compare implementations leveraging different frontends. We find that differences in program representation have negligible impact on the precision of the pointer analysis. In addition, we also discovered novel sources of unsoundness and imprecision in the program analysis. Our results also demonstrate that the promised heap abstraction are practically not similar, even though they may appear so on a birds eye view. Since pointer analysis builds the foundation of many static analyses, we conjecture the results generalize these, as well.