Symbolic execution formally explained

. In this paper, we provide a formal explanation of symbolic execution in terms of a symbolic transition system and prove its correctness and completeness with respect to an operational semantics which models the execution on concrete values. We ﬁrst introduce a formal model for a basic programming language with a statically ﬁxed number of programming variables. This model is extended to a programming language with recursive procedures which are called by a call-by-value parameter mechanism. Finally, we present a more general formal framework for proving the soundness and completeness of the symbolic execution of a basic object-oriented language which features dynamically allocated variables.


Introduction
Symbolic execution [Kin76] plays a crucial role in modern testing techniques, debugging, and automated program analysis. In particular, it is used for generating test cases [AAGR14, BCD + 18]. Intuitively, its success is mainly due because one symbolic execution abstracts away a possibly infinite set of concrete executions, all having in common a similar execution path.
Although symbolic execution techniques have improved enormously in the last few years, not much effort has been spent on its formal justification. In fact, the symbolic execution community has concentrated most of the effort on effectiveness (improvement in speed-up) and significance (improvement in code coverage) and payed little attention to correctness so far [BCD + 18].
Further, there exists a plethora of different techniques for one of the major problems in symbolic execution, namely the presence of dynamically allocated program variables, e.g., describing arrays and (object-oriented) pointer structures ("heaps"). For example, in [XMSN05] a heap is modeled as a graph, with nodes drawn from a set of objects and updated lazily, whereas [BDP15] introduces a constraint language for the specification of invariant properties of heap structures. In [DLR06] the symbolic state is extended with a heap configuration used to maintain objects which are initialized only when they are first accessed during execution. In the presence of aliasing, the uncertainty on the possible values of a symbolic pointer is treated either by forking the symbolic state or refining the generated path condition into several ones [TS14]. In [KC19] the expensive forking is avoided by using a segmented memory model. Powerful symbolic execution tools [CDE08, CGP + 08, EGL09] handling arrays exploit various code pre-processing techniques, though formal correctness of the theory behind these tools is acknowledged as a potential problem that might limit the validity of the internal engine, and is validated only experimentally by testing [PMZC17]. The KeY theorem prover [ABB + 16] supports symbolic execution of Java programs which is defined in terms of the underlying dynamic logic and which uses an explicit representation of the heap. In all of the above work no explicit formal account of the underlying model of the symbolic execution, and its correctness, is presented.
The main contribution of this paper is a formal explanation of symbolic execution in terms of a symbolic transition system and a general definition of its correctness and completeness with respect to an operational semantics which models the actual execution on concrete values. It extends the work presented in [dBB19] first of all by detailed proofs of correctness and completeness. It further describes two different approaches to symbolic execution, focusing on the generation of so-called path conditions. These are symbolic conditions on the initial states which ensure the concrete execution of the program along a given path, i.e., a particular "flow" of control described by the program.
In this paper, following [dBB19], we first formalize the standard approach to symbolic execution which consists of generating a path condition on-the-fly by maintaining during the symbolic execution a symbolic representation of the concrete program state, i.e., the assignment of values to program variables. This approach gives rise to what is usually called forward symbolic execution.
In this paper, we further introduce a new, more fundamental approach to the symbolic generation of path conditions. This approach is based on the application of a weakest precondition calculus on symbolic execution traces which are generated by a static unfolding of the program. We describe this new approach, which we claim is applicable to any programming language, in terms of a basic object-oriented language.
In [dBB19] we briefly sketched how to extend the on-the-fly generation of a path condition to object-oriented languages. The forward symbolic execution of object-oriented programs however is complicated by the symbolic description of the concrete program state in terms of the unbounded number of heap (or navigation) expressions of the programming language. This requires a complicated treatment of aliasing between these expressions in the symbolic description of heap assignments. We show that this new approach overcomes the above complications of the standard forward symbolic execution of object-oriented languages.
In [SAB10] a forward symbolic execution model is introduced as an extension of the operational semantics of a low-level assembly like language which adds to the configuration the accumulated path condition. The operational semantics itself is extended by the inclusion of symbolic values, that is, partially evaluated expressions, in the set of concrete values. In [SAB10] no formal justification of this hybrid approach is given, in fact, this hybrid approach does not allow for a formal statement of a (simulation) relation between the concrete and the symbolic semantics. The only other approach to a formal modeling and justification of symbolic execution, we are aware of, is the work presented in [LRA17]. A major difference with our approach is that in [LRA17] symbolic execution is defined in terms of a general logic (called "Reachability Logic") for the description of transition systems which abstracts from the specific characteristics of the programming language. A symbolic execution then consists basically of a sequence of logical specifications of the consecutive transitions. On the other hand, a model of the logic defines a concrete transition system. Thus correctness basically follows from the semantics of the logic. In our approach we both model symbolic execution and the concrete semantics (of any language) independently as transition systems. However, in both cases the transitions are directly defined in terms of the program to be executed. This allows to address the specific characteristics of the programming language (like dynamically allocated variables) still in a general manner. In [LRA17], however, these specific characteristics (like arrays) need to be imported in the general framework by corresponding logical theories which require an additional justification.
Detailed plan of the paper. In Section 2 we introduce a formal model of symbolic execution for a basic programming language with a statically fixed number of programming variables. A configuration of the symbolic transition system consists of the program statement to be executed, a substitution, and a path condition. Correctness then states that for every reachable symbolic configuration and state which satisfies the path condition, there exists a corresponding concrete execution. Conversely, completeness states that for every concrete execution there exists a corresponding symbolic configuration such that the initial state of the concrete execution satisfies the path condition and its final state can be obtained as a composition of the initial state and the generated substitution. In Subsection 2.1 we describe an extension of the basic theory with arrays symbolically modeled as mathematical functions.
In Section 3, we extend the basic theory of symbolic execution to a programming language with recursive procedures which are called by a call-by-value parameter mechanism. This extension requires a formal treatment of local variables stored on the stack of procedure calls.
In Section 4 we introduce a different, and more fundamental, approach to symbolic execution and its application to a basic object-oriented language. We conclude this section with a brief discussion of how to symbolically execute this language in the on-the-fly generation of path conditions, using the symbolic interpretation of fields as arrays.
In the final Section 5 we conclude with a brief discussion how multi-threading, and concurrent objects can be treated, showing the generality of our theory of symbolic execution.

Basic symbolic execution
We assume a set of V ar of program variables x , y, u, . . ., and a set Ops of operations op, . . . . We abstract from typing information, but we assume Ops includes standard Boolean operators. The set E x pr of programming expressions e is defined by the following grammar.
where x ∈ V ar and op ∈ Ops. Expressions e consist of program variables x and operators op applied to expressions (as a special case, we include values v as nullary operators).
Statements S of the basic programming language are then defined by the grammar: This basic language thus consists of (side-effect free) assignments x : e, and the usual control structures of sequential composition, choice, and iteration. In the latter two constructs b denotes a Boolean expression. We assume associativity of the sequential composition operator, and use the "empty" statement , e.g., x : x , which acts as the neutral element with respect to sequential composition. It will be used to denote termination. As a consequence, every statement is equivalent to a statement of the form A; S , where A is either an assignment, a choice or an iteration construct. Finally, we work only with programs that are well-typed, so operators are recursively applied to the correct number of correctly typed sub-expressions.
A substitution σ is a function Var → Expr which assigns to each variable an expression. By eσ we denote the application of the substitution σ to the expression e, defined inductively by x σ σ (x ) op(e 1 , . . . , e n )σ op(e 1 σ, . . . , e n σ ) A symbolic configuration is a triple S , σ, φ where S denotes the statement to be executed, σ denotes the current substitution, and the Boolean condition φ denotes the path condition. Next we describe a transition system for the symbolic execution of our basic programming language defined above. Symbolic assignment We illustrate the symbolic semantics by the following simple example of the symbolic execution of a while statement.
We formalize and prove correctness with respect to a concrete semantics. A valuation V is a function V ar → Val, where Val is a set of values (including the Boolean values true and false). By V (e) we denote the value of the expression e with respect to the valuation V , defined inductively by V (op(e 1 , . . . , e n )) op(V (e 1 ), . . . , V (e n )) where op : Val n → Val denotes the interpretation of the operation op as provided by the implicitly assumed underlying model. For technical convenience we assume that the interpretation of the operations are total functions, thus avoiding errors as generated by division by zero, e.g., we stipulate that x div 0 0 (using the infix notation). A valuation obtained as a composition V • σ of a valuation V and a substitution σ is defined as usual: ). In the sequel we omit the parentheses and write V • σ (e) for the application of a valuation V • σ to the expression e (as defined above). We have the following basic substitution lemma which states that evaluating an expression e in a composition V • σ , as defined above, gives the same result as evaluating in V the expression eσ which results from first applying the substitution. The concrete semantics of our basic programming language is defined in terms of transitions S , V → S , V . The definition of this transition system is standard: where the valuation V [x : v ] denotes the state update. Such a valuation is defined by V [x : v ](y) V (y) if x and y are syntactically distinct variables, and V [x : v ](x ) v otherwise. Note that because assignments x : e are assumed side-effect free, the valuation V [x : V (e)] affects only the value of the variable x . In other words, we can define the semantics of side-effect free assignments in terms of such updates because of the absence of aliasing, i.e., absence of two distinct variables x and y which intuitively refer to the same memory location.

Concrete choice
From the above substitution lemma we derive the following corollary.

Corollary 2.2 (Soundness assignment). For
Proof We treat the main case: Let id be the identity substitution, i.e., id(x ) x , for every variable x . We have the following main correctness theorem.

Theorem 2.3 (Correctness).
If S , id, true → * S , σ, φ 1 and V (φ) true then Proof Induction on the length of S , id, true → * S , σ, φ and a case analysis of the last execution step. We consider the following cases.
First, we consider the case of an assignment as the last execution step: Induction hypothesis (note that V (φ) true): By the concrete semantics we have By the above Corollary 2.2 it then suffices to observe that Next we consider the selection of the then-branch of a choice construct: true, so by the induction hypothesis we obtain the concrete computation All other cases are treated similarly. 2 Theorem 2.3 guarantees that all possible inputs satisfying a path condition lead to a concrete state with variables conform to the substitution of the corresponding symbolic configuration. Correctness, however, is about coverage [LRA17], meaning that satisfiable symbolic execution paths can be simulated by concrete executions. The converse of correctness is completeness and is about precision [LRA17]: every concrete execution can be simulated by a symbolic one.

Theorem 2.4 (Completeness).
For any concrete computation S , V 0 → * S , V there exists a symbolic computation S , id, true → * S , σ, φ for some path condition φ and a substitution σ such that V 0 (φ) true and V V 0 • σ .
Proof As above, the proof of this theorem proceeds by induction on the length of the concrete computation and a case analysis of the last concrete execution step. We consider the following cases. First, we consider the case of an assignment as the last execution step: By the induction hypothesis there exists a symbolic computation S , id, true → * x : e; S , σ, φ for some path condition φ and a substitution σ such that V 0 (φ) true and V V 0 •σ . By the symbolic semantics we have that By the above Corollary 2.2 again it then suffices to observe that 1 For any transition relation → its reflexive, transitive closure is denoted by → * 622 F S de Boer and M Bonsangue Next we consider the case when the Boolean guard of a choice construct evaluates to true: where V (b) true. By the induction hypothesis there exists a symbolic computation for some path condition φ and a substitution σ such that V 0 (φ) true and V V 0 •σ . By the symbolic semantics we have that We then can conclude this case by again an application of the above substitution lemma from which we derive All other cases are treated similarly. 2 The above correctness and completeness theorems establish a correspondence between reachable symbolic and concrete states. It is straightforward to generalize these theorems to computations represented by sequences of (symbolic or concrete) states.

Extension to arrays
In this subsection we briefly discuss the symbolic execution of our basic programming language extended with arrays. For notational convenience we restrict to one-dimensional arrays. Following [AdBO09] and [Gri81] we view such arrays semantically as (mathematical) functions, i.e., an array variable has a type T → T , where the basic types T and T denote the type of its domain and co-domain, respectively. Thus the domain of an array can be unbounded (below we discuss the extension of our theory to bounded arrays). Given an array variable a of type T → T , the expression a[e] of type T denotes the result of applying the function associated with a to the value of e, where the expression e is of type T . We extend the basic language with expressions a[e] and assignments a[e] : e , following the approach initially proposed in [Gri81]  A substitution σ representing a concrete state then assigns to all the program variables a corresponding expression. For any array variable a, σ (a) thus denotes an array expression. Given this symbolic interpretation of arrays as mathematical functions it is straightforward to extend the symbolic execution of our basic programming language to arrays, and generalize the above correctness and completeness theorems.
There are various ways to symbolically execute bounded arrays (see for example [FLP17]). One possible way to extend our approach to bounded arrays (of type N → T , for some T , where N denotes the type of the natural numbers) consists of adding the expression | a | which denotes the length of the array a. The symbolic execution of (initially) setting the bound of an array, described by the statement | a | e, then updates the path condition with the information | a | eσ , where σ is the current substitution. We describe the absence of an array-out-of-bound error by a predicate δ(e) defined inductively by We indicate the occurrence of an array-out-of-bound error by a statement array-out-of-bound. This statement then can be further evaluated in the context of error-handling constructs. The symbolic execution of such constructs is out of scope of this paper though.
It is then straightforward to update the above symbolic transition system to account for array-out-of-bound errors. For example, for the symbolic execiution of an assignment x : e we have the following two transitions: As another example, for the symbolic execution of the choice construct we have four transitions, of which we present the following two: For an array assignment a[e] : e we have the following symbolic transitons.
Note that, as defined above, δ(a[eσ ]) equals 0 ≤ eσ ≤| a | ∧δ(eσ ). We do not need to apply the current substitution σ to a in the expression | a | because the bound of an array is not affected by any of its assignments.
We conclude this discussion with the observation that alternatively we can include instead of the array variables the expressions a[e] themselves in the domain of a substitution representing the concrete state. This however gives rise to aliasing between an unbounded number of such expressions: to execute symbolically an assignment a[e] : e we need to update in the given substitution all possible aliases of a[e], namely all those expressions a[e ] such that the value of e after the assignment a[e] : e equals that of e. We therefore recommend the above approach.

Recursion
In this section we extend our basic programming language with procedures. We assume a finite set of V ar of program variables x , y, u, . . . to be partitioned in global variables GV ar and local variables LV ar, without name clashes between them. Global variables are visible within the entire program while local variables are used as formal parameters of the procedure declarations, and their scope lie within the procedure body itself. We denote byx a tuple of variables.
The set E x pr of programming expressions e is defined as in the previous section, except that (Boolean) operators involve now both local and global variables. A pr ogram consists of set of procedure declarations of the form P (ū) :: S and a main statement S . Every procedure name P had a unique declaration. For simplicity we assume here thatū consists of all local variables LV ar. Statements of the basic programming language with procedures are defined by extending the grammar of the previous section with procedure calls: S :: x : e (global and local) assignment Beside (side-effect free) assignments to global and local variables (here x ∈ V ar), sequential composition, choice, and iteration, we have procedure calls P (ē), assuming a call by value parameter passing mechanism. Again, we consider only programs that are well-typed, meaning, among other things, that the length ofē in the call P (ē) is the same as thatū in the declaration P (ū) :: S . For technical convenience we do not consider the introduction of local variables by block statements here, that could be easily modelled using procedures.
A symbolic configuration is of the form , σ, φ , where • denotes a stack of closures of the form (τ, S ), where τ : LV ar → E x pr is a local substitution (assigning expressions to local variables, including the formal parameters). The top of a stack is the leftmost closure, if it exists. In the sequel, we denote by (τ, S ) · the result of pushing the closure (τ, S ) unto the stack .
• σ : GV ar → E x pr is the current global substitution mapping expressions to global variables, • φ is a Boolean expression denoting the path condition.
A configuration (id, S ), id, φ is called initial if no local variable occurs in S , that, in this case, denotes the main statement.

F S de Boer and M Bonsangue
We denote by τ ∪ σ : Var → Val the union of a local substitution τ and a global substitution σ (defined in terms of their representations as sets of pairs argument/value). This is well-defined and total because of the absence of name clashes between local and global variables of a programs. Further, e(τ ∪ σ ) represents the expression e in which all occurrences of local and global variables have been substituted as dictated by their respective substitutions. We have the following symbolic transitions.
Symbolic assignment global variable Let x be a global variable.
where θ τ ∪ σ . Note that expressions may contain both global and local variables, and that is why we need θ in the update global substitution σ [x : eθ ]. Also, the topmost local substitution τ in is not affected by a global assignment because it is side-effect free.
Symbolic assignment local variable Let u be a local variable.
where θ τ ∪ σ . As for the case of assignment global variable, expressions may contain any variables, and that is why we need θ in the update local substitution τ [u : eθ ]. Only the topmost substitution in the stack is affected by a local assignment.
Symbolic procedure call Given a procedure declaration P (ū) :: S , we have . A procedure call thus pushes a new closure unto the stack. This closure consists of the body of the procedure and a new local substitution which assigns the actual parameters to the formal parameters, implementing a call by value parameter passing mechanism. This mechanism thus avoids name clashes between the formal parameters of different procedures because each procedure call is symbolically executed with respect to its own local substitution. Termination of a procedure call is indicated by the empty statement , so that the return of a procedure call can be described simply by "popping" the top closure of the stack. Execution then can continue with executing the closure (τ, S ) that generated the call. This is formalized by the following transition.

Symbolic procedure return
Recall that denotes the empty statement, so when there is nothing more to be currently executed, a pop operation on the stack of closure ensures that the control goes back to the procedure caller, restoring the local substitution but continuing with the global one.

Symbolic choice
Because the main statement does not contain local variables, we have that local variables will never appear in global and local substitution of any reachable configuration.
Proposition 3.1 For any computation (id, S ), id, true → n (τ, S ) · , σ, φ of n > 0 steps, where S denotes the main statement, we have that both τ (u), for any local variable u in its domain, and σ (x ), for any global variable x , do not contain local variables.
Proof We prove it by course of value induction on the length of the computation. For the case when S begins with a global assignment x : e it is enough to notice that θ id ∪ id, and that e does not contain local variables because they do not occur in the main statement S . Assignment to local variables cannot occur in S , so the only remaining interesting base case of our induction is when S begin with a procedure call P (ē). In this case the new local substitution τ assigning expressions e(id ∪ id) ē to the formal parameter (i.e. local variables)ū. But every expression inē does contain local variables, so they will not occur in the local substitution in the closure of the new top of the stack.
All other base cases are trivially true. So next we consider the case of a global assignment after a computation of length n > 0: where θ τ ∪σ . By induction hypothesis, both τ (u), for every local variable u, and σ (x ), for every global variable x , do not contain local variables. So eθ cannot contain local variables too. A similar reasoning can be applied to the cases of procedure call and local assignment.
For the case of procedure return we use the the stronger hypothesis of our course of value induction: If (id, S ), id, true → n (τ, ) · , σ, φ → , σ, φ then σ (x ) does not contain local variables by induction hypothesis. If is the empty string then there is nothing more to prove, and otherwise (τ , S ) · which means that there must exist an intermediate configuration (τ , S ) · , σ , φ such that Since path conditions are constructed by local and global substitutions applied to Boolean expressions, we have the following immediate corollary. As in the previous section, we have a substitution lemma which states that evaluating an expression e in the composition of a valuation V after a substitution θ gives the same value as evaluating in V the expression eθ .

Concrete choice
From the above substitution lemma we derive the following corollary. Proof We treat the case of global variables, the one for local variables is analogous: We have the following correctness theorem of the symbolic execution of recursive programs: for every reachable configuration of a symbolic execution there exists an analogous concrete configuration reachable from every valuations satisfying the last path condition.

Theorem 3.5 (Correctness). For a main statement S with no occurrences of local variables and
) for any global variable x , V • τ (x ) V (τ (u)) for any local substitution τ and local variable u, and V • denotes the concrete stack resulting from replacing every local environment τ in by V • τ .
Proof As in Theorem 2.3, we proceed by induction on the length of the symbolic computation and a case analysis of the last execution step.
First, we consider the case of a global assignment as the last execution step:

is true. By induction hypothesis there exists a concrete computation
Let V V • σ and G its restriction to global variables. By the concrete semantics of global assignment we have By Corollary 3.4 it then suffices to observe that V • (σ [x : eθ ]) G [x : V (e)]. The case of a local assignment is similar, using the local variable part of Corollary 3.4. Given the procedure declaration P (ū) :: S , next we consider the case of a procedure call as last step of a symbolic execution (id, S ), id, true → * (τ, P (ē); S ) · , σ, φ → (τ , S ) · (τ, ; S ) · , σ, φ Symbolic execution formally explained 627 where τ (ū) ē(τ ∪ σ ). Let V G ∪ L be such that V (φ) is true. By induction hypothesis there exists a concrete computation By the concrete semantics of procedure call we have , concluding the proof for this case.
The remaining interesting case of procedure return as last step of a symbolic computation is easier: Let V G ∪ L be such that V (φ) is true. By induction hypothesis there exists a concrete computation By the concrete semantics of procedure call we have The choice construct and the iterations can be treated in a similar way as in Theorem 2.3. 2 As in the basic case, we have a similar completeness result for recursive procedures expressed in terms reachable concrete configurations: for every reachable concrete configuration starting from a global valuation G there exists an analogous reachable symbolic configuration with a path condition satisfied by G. The case of global assignment is analogous. Next we treat another case, the one of a procedure call as the last execution step: where P (ū) :: S and L (ū) V (ē), with V L ∪ G . By induction hypothesis there exists a symbolic computation (id, S ), id, true → * (τ, P (ē); S ) · , σ, φ for some path condition φ and a substitutions τ and σ such that V (φ) is true, V • , G V • σ and L V • τ . By the symbolic semantics of procedure call we have that (τ, P (ē); S ) · , σ, φ → (τ , S ) · (τ, S ) · , σ, φ 628 F S de Boer and M Bonsangue where τ (ū) ēθ , with θ τ ∪ σ . By the above we can conclude the proof with the following series of equality The case of choice and iteration constructs can be essentially treated in the same way as the completeness theorem of the previous section. All other cases are treated similarly. 2

Object orientation
We next extend the language of the previous section with object-oriented features. We first introduce a distinction between variables x , y, . . . , e.g., the formal parameters of methods, including the keyword this, and fields f , f , . . . (of the classes of a program). In contrast to the previous sections, because of dynamically allocated objects, we need to assume here an infinite number of variables. We abstract again from the typing information of the variables and fields. We have the following syntax of side-effect free programming expressions e: op(e 1 , . . . , e n ) where nil stands for "undefined' (used to initialize the fields of newly created objects)', x is a variable, and in the expression e.f we implicitly assume that f is a field of the class of the object denoted by e. For notational convenience, in the sequel h denotes a variable x or an expression e.f . An expression h is called a heap variable. As in the previous section op denotes a built-in operation (which includes for example equality '='). We assume a ternary operation which describes conditional expressions written as if b then e 1 else e 2 fi, where b denotes a Boolean expression. Without loss of generality we assume that expressions which denote objects can only be used for dereferencing, for checking equality (denoted by '='), or as argument of a conditional expression. More specifically, we do not have, for example, built-in operations for constructing sets or sequences of objects. Statements S of the object-oriented programming language are then defined by the grammar: A statement h : new C creates an instance of class C (we model a call x : new C (ē) of a constructor method by the object creation statement x : new C followed by a method call x .C (ē)). A method call h : e 0 .m(ē) specifies the called object e 0 . The execution of a method terminates with a statement return e which returns the value of e. Each class definition consists of a set of method definitions. A method definition specifies its formal parameters and a statement (its "body"). Finally, a program consists of some class definitions and a main statement which only involves (global) variables.
The approach of the previous sections is naturally extended by interpreting fields as arrays, i.e., a heap expression e.f is then interpreted as f [e]. This is briefly discussed in Section 4.1. Here we proceed with a more fundamental approach to symbolic execution which is based on symbolic execution traces and a weakest preconditon calculus. In this new approach the symbolic execution of a program is described in terms of a transition relation between configurations of the form S , ρ , where ρ denotes a symbolic execution trace consisting of Boolean expressions and assignments. We will define the path condition of a symbolic trace by means of a weakest precondition calculus. Since our language does not include error-handling constructs, it suffices that the resulting path condition will ensure absence of so-called nil-pointer errors, which are generated by expressions e.f in case the expression e denotes the nil pointer. Note that such path conditions still can be used to test nil pointer errors occurring at a specific control point. The symbolic execution of constructs for handling nil pointer errors can be modeled following the basic approach described in subsection 2.1 for array-out-of-bound errors, but is out of scope of this paper, as are other object-oriented features like dynamic method dispatch. Most of such additional features however we expect can be integrated by a symbolic representation of their standard operational semantics (similar to the symbolic representation of the standard operational semantics of the choice and iteration constructs).
For our object-oriented language we have the following transition rules for generating such traces by statically unfolding a statement.

Symbolic assignment
• h : e; S , ρ → S , ρ · h : e Symbolic object creation where x is a fresh variable not appearing in h : new C ; S , ρ .The introduction of a fresh variable in the above transition allows to disentangle aliasing and object creation in the definition of a path condition (as defined below). The condition e 0 nil records that the callee is not nil. Byȳ : ē we denote a sequence of assignments of the expressionsē to the variablesȳ (note that the order does not matter, because the variablesȳ do not appear inē). The variable r is used to denote the return value (see below). Note that thus a method call symbolically is described by inlining the method body, replacing the formal parameters by fresh variables thus avoiding name clashes between different method invocations. Note that in the semantics of (recursive) procedures (described in the previous section) such name clashes are avoided because of the use of an explicit stack of closures so that each call is executed in its own local environment (which assigns values to its local variables).

Symbolic method return
• return e; S , ρ → S , ρ · r : e , where r is the distinguished variable used to denote the return value. The following symbolic transitions of the choice and iteration statements simply record the (negation of) the Boolean condition.

Symbolic choice
Note that the above symbolic transition system abstracts from nil-pointer errors, e.g., for the symbolic execution of the Boolean conditions in the choice and iteration statements we do not consider the third possibility that the evaluation of the condition generates a nil-pointer error. We emphasize here again that for the purpose of generating path conditions this is not needed because our language does not include error-handling constructs and thus a path condition should only ensure absence of such errors.
To define the path condition and the substitution of a symbolic execution trace, we first need the following basic substitutions.  new C ] to the following cases: The last clause also covers the case of conditional expressions.
Further, following subsection 2.1, we define the absence of a nil-pointer errors by a predicate δ(e).

Definition 4.1 (Absence of nil-pointer errors).
We define δ(e) inductively by e nil ∧ δ(e) δ(op(e 1 , . . . , e n )) δ(e 1 ) ∧ · · · ∧ δ(e n ) We can now define the path condition of an execution trace by moving forward (from left to right) through it. Recall that until now, we always appended the last "action" to the right of the execution trace.

Definition 4.2 (Path condition).
We define the path condition path(ρ) of a symbolic execution trace inductively by (here denotes the empty path): where in the third clause δ(h, e) abbreviates the conjunction δ(h) ∧ δ(e).
Note that we do not include the predicate δ(b) in the path condition in case of a Boolean expression, because, as we will see below, if a Boolean expression evaluates to true then clearly its evaluation does not generate a nil-pointer error. In general, the above definition of a path condition thus ensures absence of nil-pointer errors.
Next we discuss the concrete semantics. Given a program, for each of its classes we assume an unbounded set of object identities (which for different classes are disjoint). Such identities we denote in the sequel by o, o , etc.. The concrete semantics then can be defined by a translation relation between configurations S , V , where the semantics of method calls is defined as above in terms of body replacement (replacing the local variables by fresh variables). The details of this semantics are standard (see for example [AdBO09]) and therefore omitted. Here it suffices to observe that in general the standard concrete semantics of a program in any (imperative) programming language can be defined also in terms of the concrete semantics of its symbolic execution traces. Below we illustrate this general approach by means of our object-oriented language.
First, we define the value V (e) of an expression e in the configuration V . By 'V (e) ⊥' we denote that the evaluation of e gives rise to a nil-pointer error. Let V (G, H ).
where op again denotes the given semantic interpretation of op, which is strict in the sense that it yields ⊥ if one if its arguments is ⊥.
The following lemma proves a semantic justification of the predicate δ(e).

Lemma 4.3 For any expression e we have that
Proof The proof involves a straightforward induction on the structure of e. 2 Note that for any Boolean expression b we have that V (b) true thus implies V (δ(b)) true. In words, if a Boolean expression evaluates to true then clearly its evaluation does not generate a nil-pointer error. This explains the above definition of a path condition in case of Boolean expressions.
We We are now sufficiently equipped to define the concrete semantics of symbolic execution traces as a transition relation between configurations V , ρ .

Concrete object creation
In order to formally relate the semantics of a path condition of a symbolic execution trace and its concrete semantics as defined above, we need the following substitution lemma (see also [dB99]).  • V (y) G[x : o](y) G(y). From the consistency of V we conclude that G(y) nil implies H (G(y)) ⊥.
• V (if b then e 1 else e 2 fi) equals either V (e 1 ) or V (e 2 ). By assumption both e 1 and e 2 are different from x , and so by induction both V (e 1 ) and V (e 2 ) are either nil or an old object.
where e is different from x . By induction V (e) is either nil or an old object, and so by the consistency of V and the So H (V (e).f ) H (V (e).f )) denotes an old object.
We can now proceed with the following case analysis. The last clause also covers the case of conditional expressions. 2 The following soundness and completeness theorem abstracts from programs and is stated directly in terms of symbolic execution traces (which we only assume to be well-typed). For notational convenience, let V , ρ → * denotes the existence of a concrete configuration V such that V , ρ → * V , . We are left with the following case: ρ b · ρ . By definition of the concrete semantics, V , ρ → * iff V (b) true and V , ρ → * . By the induction hypothesis we have that V , ρ → * iff V (path(ρ )) true. Suffices then to observe that by definition path(ρ) b ∧ path(ρ ). 2 Note that the above theorem is thus applicable to any symbolic trace generated by the symbolic transitions system, that is, to any symbolic execution S , → * S , ρ . Differently from the approach of the previous sections however we do not have a direct symbolic representation by a substitution of the state V for which there exists a computation V , ρ → * V , . This state V is only implicitly given by the symbolic trace ρ itself. In fact, the development of our new approach is primarily motivated by avoiding the computation of such a substitution in the construction of the path condition.

Fields as arrays
In the above approach to the generation of a path condition no substitution representing a concrete state is needed. Instead, a weakest precondition calculus is used to calculate the path condition as the weakest precondition of the given symbolic trace (seen as a basic straight-line program). As such the path condition is calculated in a backward manner whereas the trace is generated by a basic forward symbolic execution of the given program.
We briefly describe here how to generate the path condition in one pass by a standard forward symbolic execution of the given program, using a substitution to represent symbolically the concrete state. As already stated above, we can do so simply by modeling symbolically fields as arrays of type O → T , where O is the type of all objects, and process field updates symbolically as array updates, as described in Subsection 2.1). More specifically, any expression e is processed symbolically as the expression A(e) defined inductively, e.g., . A heap update h : e then is symbolically processed as the array assignment A(h) A(e). Thus we have reduced the object-oriented language to the language with recursive procedures (as introduced in Section 3) extended with arrays. The main question then remains how to update the substitution representing a concrete state in case of an assignment h : new C . We can do so by simply processing such an assignment symbolically in terms of a statement h : x ; x .f 1 : nil; . . . x .f n : nil where x is a fresh variable and f 1 , . . . , f n are all the fields declared in class C . We can remove again occurrences of these fresh variables in the generated path condition by applying the substitutions [x : new C ], as defined above but now reformulated in terms of arrays, e.g., the clause for (

Conclusion
Despite the popularity and success of symbolic execution techniques, to the best of our knowledge, a general theory of symbolic execution is missing which covers in an uniform manner mainstream programming features like arrays and (object-oriented) pointer structures, as well as local scoping as it arises in the passing of parameters in recursive procedure calls. In fact, most existing tools for symbolic execution lack an explicit formal specification and justification.
In this paper we provided a detailed discussion of the on-the-fly (or forward) symbolic generation of path conditions, using a symbolic representation of the concrete program state. We discussed main (object-oriented) programming features, including arrays. In fact, we showed how to model symbolically arrays as mathematical functions and object structures as arrays.
We further introduced a new, more fundamental approach which computes the path condition by means of a weakest pre-conditon calculus ( [BK94]), and as such avoids the use of a symbolic representation of the concrete program state. This approach (introduced in Section 4) is based on symbolic execution traces which are obtained from unfolding statically the control flow of a program, accumulating the basic instructions , e.g., assignments and Boolean tests. As such these traces abstract from the specific control structures of the programming language. In the second phase, the application of a weakest precondition calculus for these basic instructions allows to generate a path condition from a symbolic execution trace.
It should be noted that the notion of a path condition as a Boolean condition on the initial state which ensures the concrete execution of the program along a given path is only applicable to deterministic languages. However, our approach based on symbolic execution traces is applicable also to non-deterministic languages, e.g., multi-threaded shared variable programs. For example, in [dBBJ + 20] information about the non-deterministic choices, e.g., scheduling information, is naturally integrated in such traces for multi-threaded shared variable programs. In [dBBJ + 20] this information is further used for partial order reduction of the exploration of the symbolic execution traces.
In general, symbolic execution traces can be subject to different kinds of analysis: e.g., the path condition, the symbolic representation of the concrete program states, and partial-order reduction. Its abstraction from control structures also allows to simulate the actual execution of a program by executing instead the symbolic representation of the final concrete program state encoded in such a trace. This also allows to combine deductive verification and testing: for a path condition φ and a substitution σ representing the final concrete program state of a symbolic execution trace, instead of executing the program for a given initial state which satisfies φ and testing whether the final state satisfies a condition ψ, we can also verify the implication φ → (ψσ ).
A further illustration of the generality of our approach is its application to concurrent objects as, for example, in the Abstract Behavioral Specification (ABS) language [JHS + 12] which describes systems of objects that interact via asynchronous method calls. Such a call spawns a corresponding process associated with the called object. Return values are communicated via futures [dBCJ07]. Each object cooperatively schedules its processes one at a time. The processes of an object can only access their local variables and the instance variables (fields) of the object. Symbolically, the semantics can be described analogously to the one in Section 4. To model the communication of the return values by futures we can introduce for each process a distinguished local variable.
The major potential practical implication of computing path conditions as the weakest precondition of a symbolic trace is that it supports a clear separation of concerns between the generation of symbolic traces by unfolding the program and the computation of the path condition by means of a weakest precondition calculus. The use of a backward calculation of path conditions by a weakest precondition calculus further avoids the (symbolic) state-space explosion due to various forms of aliasing. To investigate the practical implications of our work, we will develop prototype implementations to compare the performance between the two different formal models of symbolic execution discussed in this paper. These prototype implementations will further be used to compare performance with other tools, and investigate optimizations.
Another interesting research direction is the development of a further extension of our theory for concolic execution, mixing symbolic and concrete executions [GKS05], and the symbolic backward execution [CFS09].