Our goal with PhASAR is easing the formulation of a data-flow analysis such that an analysis developer only needs to focus on the implementation of the problem description rather than providing details how the problem is solved.
PhASAR achieves parts of its generalizability through template parameters. These template parameters include, among others, N, D, M. They are consistently used throughout the implementation of PhASAR. N denotes the type of a node in the ICFG, i.e., typically an IR statement, D denotes the domain of the data-flow facts, and M is a placeholder for the type of a method/function. When analyzing LLVM IR, N is always of type const llvm::Instruction* and M is of type const llvm::Function*, whereas D depends on the specific data-flow analysis that the developer wants to encode. For our example using linear constant propagation described in Sect. 3,
could be used to capture the property of interest. LLVM’s Value type is quite useful as it is a super-type that is located high in the type hierarchy. This allows an analysis developer to use values of all of Value’s subtypes in the value domain, which makes it highly flexible.
5.1 Encoding an IFDS Analysis
Listing 1.2 shows the interface for an IFDS problem. An analysis developer has to define a new type—the problem description—implementing the FlowFunctions interface.
The flow function factories shown in Listing 1.2 handle the different types of flows. The four factory functions each have an individual purpose:
-
getNormalFlowFunction handles all intra-procedural flows.
-
getCallFlowFunction handles inter-procedural flows at a call-site. Usually, the task of this flow function factory is to map the data-flow facts that hold at a given call-site into the callee method’s scope.
-
getRetFlowFunction handles inter-procedural flows at an exit statement (e.g. a return statement). This maps the callee’s return value, as well as data-flow facts that may leave the function by reference or pointer parameters, back into the caller’s context/scope.
-
getCallToRetFlowFunction propagates all data-flow facts that are not involved in a call along-side the call-site, typically stack-local data not referenced by parameters.
These flow function factories are automatically queried by the solver, based on the inter-procedural control-flow graph.
The functions in Listing 1.2 are factories since they have to return small function objects of type FlowFunction which is shown in Listing 1.3. As a FlowFunction is itself an interface, an analysis developer has to provide a suitable implementation. The member function computeTargets() takes a value of a dataflow fact of type D and computes a set of new dataflow facts of the same type. It specifies how the bipartite graph for the statement that represents the flow function is constructed and can be thought of an answer to the question “What edges must be drawn?”.
As flow function implementations often follow certain patterns, we provide implementations for the most common patterns as template classes. Many useful flow functions like Gen, GenIf, Kill, KillAll, and Identity are already implemented and can be directly used. Any number of flow functions can be easily combined using our implementations of the Compose and Union flow functions. We also provide MapFactsToCallee and MapFactsToCaller flow functions that automatically map parameters into a callee and back to a caller, since this behavior is frequently desired. Flow functions which are stateless, e.g. Identity or KillAll, are implemented as a singleton.
5.2 Encoding an IDE Analysis
If an analysis developer wishes to encode their problem within IDE, they have to additionally provide implementations for the edge functions. With help of the edge functions, an analysis developer is able to specify a computation which is performed along the edges of the exploded super-graph leading to the queried node (c.f. Fig. 1). The interface for the edge function factories and their responsibilities are analogous to the flow function factories in Listing 1.2.
Each edge function factory must return an edge function implementation: a small function object similar to a flow function which has a computeTarget() function, a compose, a merge, and an equality-check operation. The EdgeFunction interface is shown in Listing 1.4.
As this interface is more complex than the flow function interface, we explain the purpose of each function. The computeTarget() function describes a computation over the value domain V in terms of lambda calculus.
The composeWith() function encodes how to compose two edge functions. In most scenarios, this function can be implemented as \((f \circ g)(x) = f(g(x))\). To avoid additional boilerplate code, we provide an EdgeFunctionComposer class that performs this job and can be used as a super class.
joinWith() encodes how to join two edge functions at statements where two control-flow edges lead to the same successor statement. Depending if a may or a must-analysis is performed, implementations of this function typically check which edge function computes a value that is higher up in the lattice, i.e., a more approximate value, and returns the corresponding edge function. For our linear constant propagation from Sect. 3, this function would return one of the edge functions if both describe the same value computation, the bottom edge function if both of them encode the \(\bot \) value and the edge function encoding the top element otherwise. The intuition here is to always pick the element that is higher in the lattice as it represents more information.
The equal_to() interface function has to be implemented to return true if both edge functions describe the same value computation, false otherwise.
A complete implementation of the IDE linear constant propagation can be found along with PhASAR’s other examples at our website [23].
5.3 Encoding a Monotone Analysis
If an analysis developer wishes to encode a problem that does not satisfy the distributivity property, they have to make use of the monotone-framework implementation or its inter-procedural variant. The interface for specifying an inter-procedural monotone problem is shown in Listing 1.5. Similar to an IFDS/IDE problem, an analysis developer has to specify flow functions for intra- and inter-procedural flows. But in contrast to IFDS/IDE, these flow functions do not operate on single, distributive data-flow facts, but on sets of data-flow facts instead. The solver calls the flow functions and provides the set of data-flow facts which hold right before the current statement. The return value to be computed in the flow function is a set of data-flow facts that hold after the effects of the current statement. The join() function specifies how information is merged when two branches join at a common successor statement. This is typically implemented as set-union or set-intersection depending on whether a may or must-analysis has to be solved. Algorithms from C++’s STL may be used here. Finally, the sqSubSetEqual() function must be implemented to determine if the amount of information between two sets has increased in order to check if a fixpoint is reached. The context that is used for the inter-procedural analysis can be specified by the analysis developer using the template parameter. An analysis developer can provide a pre-defined context class in order to parameterize the analysis to be a call-strings approach, a value-based approach, or they can define their own context to be used.
5.4 Handling of Intrinsic and Libc Function Calls
LLVM currently has approximately 130 intrinsic functions. These functions are used to describe semantics in the analysis and optimization phase and do not have an actual implementation. Later-on in the compiler pipeline, the back-end is free to replace a call to an intrinsic function with a software or a hardware implementation – if one exists for the target architecture. Introducing new intrinsic functions is preferred over introducing novel instructions to LLVM since, when introducing a new instruction, all optimizations, analyses, and tools built on top of LLVM have to be revisited to make them aware of the new instruction. A call to an intrinsic function can be handled as an ordinary function call.
The functions contained in the libc standard library represent special targets as well as these functions are used by virtually all practical C and C++Footnote 1 programs. Moreover, the functions contained in the standard library cannot be analyzed themselves as they are mostly very thin wrappers around system calls and are often not available for the analysis. In many cases, however, it is not necessary to analyze these functions when performing a data-flow analysis. PhASAR models all of them as the identity function. An analysis developer can change the default behavior and model different effects by using special summary functions. The SpecialSummaries class can be used to register flow and edge functions other than identity. This class is aware of all intrinsic and libc functions.
5.5 A Note on Soundness
Livshits et al. have introduced the notion soundy analyses [18]. Soundy analyses use sensible underapproximations to cope with certain language features that would otherwise make an analysis impractically imprecise. Analyses in PhASAR are currently soundy. For instance, PhASAR’s ICFG misses one control-flow edge in the presence of setjmp()/longjmp(). Functions that are loaded dynamically from shared object libraries using dlsym() cannot be handled either. PhASAR’s data-flow solvers treat calls to dynamically loaded libraries and libraries for which function definitions are missing as identity, unless the analysis developer specifies otherwise. A sound handling would be to set all variables involved in such calls to \(\top \), which again, may lead to large imprecision.