We evaluated the
model in two regards, (1) the feasibility of replacing existing manual models (RQ1 and RQ2) and (2) the effects of our heuristic H on the analysis soundness (RQ3). The research questions are as follow:
-
RQ1: Analysis performance of
Can \(\Downarrow _{{ SRA}}\) replace existing manual models for program analysis with decent performance in terms of soundness, precision, and runtime overhead?
-
RQ2: Applicability of
Is \(\Downarrow _{{ SRA}}\) broadly applicable to various builtin functions of JavaScript?
-
RQ3: Dependence on heuristic
H
How much is the performance of \(\Downarrow _{{ SRA}}\) affected by the heuristics?
After describing the experimental setup for evaluation, we present our answers to the research questions with quantitative results, and discuss the limitations of our evaluation.
5.1 Experimental Setup
In order to evaluate the
model, we compared the analysis performance and applicability of
with those of the existing manual models in SAFE. We used two kinds of subjects: browser benchmark programs and builtin functions. From 34 browser benchmarks included in the test suite of SAFE, a subset of V8 OctaneFootnote 1, we collected 13 of them that invoke opaque code. Since browser benchmark programs use a small number of opaque functions, we also generated test cases for 134 functions in the ECMAScript 5.1 specification.
Each test case contains abstract values that represent two or more possible values. Because SAFE uses a finite number of abstract domains for primitive values, we used all of them in the test cases. We also generated 10 abstract objects. Five of them are manually created to represent arbitrary objects:
-
OBJ1 has an arbitrary property whose value is an arbitrary primitive.
-
OBJ2 is a property descriptor whose "value" is an arbitrary primitive, and the others are arbitrary booleans.
-
OBJ3 has an arbitrary property whose value is OBJ2.
-
OBJ4 is an empty array whose "length" is arbitrary.
-
OBJ5 is an arbitrary-length array with an arbitrary property
The other five objects were collected from SunSpider benchmark programs by using Jalangi2 [20] to represent frequently used abstract objects. We counted the number of function calls with object arguments and joined the most used object arguments in each program. Out of 10 programs that have function calls with object arguments, we discarded four programs that use the same objects for every function call, and one program that uses an argument with 2500 properties, which makes manual inspection impossible. We joined the first 10 concrete objects for each argument of the following benchmark to obtain abstract objects: 3d-cube.js, 3d-raytrace.js, access-binary-trees.js, regexp-dna.js, and string-fasta.js. For 134 test functions, when a test function consumes two or more arguments, we restricted each argument to have only an expected type to manage the number of test cases. Also, we used one or minimum number of arguments for functions with variable number of arguments.
In summary, we used 13 programs for RQ1, and 134 functions with 1565 test cases for RQ2 and RQ3. All experiments were on a 2.9 GHz quad-core Intel Core i7 with 16 GB memory machine.
5.2 Answers to Research Questions
Answer to RQ1. We compared the precision, soundness, and analysis time of the SAFE manual models and the
model. Table 1 shows the precision and soundness for each opaque function call, and Table 2 presents the analysis time and number of samples for each program.
As for the precision, Table 1 shows that
produced more precise results than manual models for 9 (19.6%) cases. We manually checked whether each result of a model is sound or not by using the partial order function (\(\sqsubseteq \)) implemented in SAFE. We found that all the results of the SAFE manual models for the benchmarks were sound. The
model produced an unsound result for only one function: Math.random. While it returns a floating-point value in the range [0, 1),
modeled it as NUInt, instead of the expected Number, because it missed 0.
As shown in Table 2, on average \(\Downarrow _{{ SRA}}\) took 1.35 times more analysis time than the SAFE models. The table also shows the number of context-sensitive opaque function calls during analysis (#Call), the maximum number of samples (#Max), and the total number of samples (#Total). To understand the runtime overhead better, we measured the proportion of elapsed time for each step. On average, \({ Sample}\) took 59%, \({ Run}\) 7%, \({ Abstract}\) 17%, and the rest 17%. The experimental results show that \(\Downarrow _{{ SRA}}\) provides high precision while slightly sacrificing soundness with modest runtime overhead.
Answer to RQ2. Because the benchmark programs use only 15 opaque functions as shown in Table 1, we generated abstracted arguments for 134 functions out of 169 functions in the ECMAScript 5.1 builtin library, for which SAFE has manual models. We semi-automatically checked the soundness and precision of the \(\Downarrow _{{ SRA}}\) model by comparing the analysis results with their expected results. Table 3 shows the results in terms of test cases (left half) and functions (right half). The Equal column shows the number of test cases or functions, for which both models provide equal results that are sound. The SRA Pre. column shows the number of such cases where the \(\Downarrow _{{ SRA}}\) model provides sound and more precise results than the manual model. The Man. Uns. column presents the number of such cases where \(\Downarrow _{{ SRA}}\) provides sound results but the manual one provides unsound results, and SRA Uns. shows the opposite case of Man. Uns. Finally, Not Comp. shows the number of cases where the results of \(\Downarrow _{{ SRA}}\) and the manual model are incomparable.
Table 1. Precision and soundness by functions in the benchmarks
Table 2. Analysis time overhead by programs in the benchmarks
Table 3. Precision and soundness for the builtin functions
Table 4. Soundness and sampling cost for the builtin functions
The \(\Downarrow _{{ SRA}}\) model produced sound results for 99.4% of test cases and 94.0% of functions. Moreover, \(\Downarrow _{{ SRA}}\) produced more precise results than the manual models for 33.7% of test cases and 50.0% of functions. Although \(\Downarrow _{{ SRA}}\) produced unsound results for 0.6% of test cases and 6.0% of functions, we found soundness bugs in the manual models using 1.3% of test cases and 7.5% of functions. Our experiments showed that the automatic \(\Downarrow _{{ SRA}}\) model produced less unsound results than the manual models. We reported the manual models producing unsound results to SAFE developers with the concrete examples that were generated in the \({ Run}\) step, which revealed the bugs.
Answer to RQ3. The sampling strategy plays an important role in the performance of \(\Downarrow _{{ SRA}}\) especially for soundness. Our sampling strategy depends on two factors: (1) manually sampled sets via the heuristic H and (2) each-used or pair-wise selection for object samples. We used manually sampled sets for three abstract values: UInt, NUInt, and OtherStr. To sample concrete values from them, we used three methods: Base simply follows the guidelines described in Sect. 3.1, Random generates samples randomly, and Final denotes the heuristics determined by our trials and errors to reach the highest ratio of sound results. For object samples, we used three pair-wise options: HeapPair, ThisPair, and ArgPair. For various sampling configurations, Table 4 summarizes the ratio of sound results, the average and maximum numbers of samples for the test cases used in RQ2.
The table shows that Base and Random produced sound results for 85.0% and 84.9% (the worst case among 10 repetitions) of the test cases, respectively. Even without any sophisticated heuristics or pair-wise options, \(\Downarrow _{{ SRA}}\) achieved a decent amount of sound results. Using more samples collected by trials and errors with Final and all three pair-wise options, \(\Downarrow _{{ SRA}}\) generated sound results for 99.4% of the test cases by observing more behaviors of opaque code.
5.3 Limitations
A fundamental limitation of our approach is that the \(\Downarrow _{{ SRA}}\) model may produce unsound results when the behavior of opaque code depends on values that \(\Downarrow _{{ SRA}}\) does not support via sampling. For example, if a sampling strategy calls the Date function without enough time intervals, it may not be able to sample different results. Similarly, if a sampling strategy does not use 4-wise combinations for property descriptor objects that have four components, it cannot produce all the possible combinations. However, at the same time, simply applying more complex strategies like 4-wise combinations may lead to an explosion of samples, which is not scalable.
Our experimental evaluation is inherently limited to a specific use case, which poses a threat to validity. While our approach itself is not dependent on a particular programming language or static analysis, the implementation of our approach depends on the abstract domains of SAFE. Although the experiments used well-known benchmark programs as analysis subjects, they may not be representative of all common uses of opaque functions in JavaScript applications.