1 Introduction

In [11, 16, 17, 25], we introduced an approach for automatic termination analysis of C that also handles programs whose termination relies on the relation between allocated memory addresses and the data stored at such addresses. This approach is implemented in our tool AProVE [14]. Instead of analyzing C directly, AProVE compiles the program to LLVM code using Clang [9]. Then it constructs a (finite) symbolic execution graph (SEG) such that every program run corresponds to a path through the SEG. AProVE proves memory safety during the construction of the SEG to ensure absence of undefined behavior (which would also allow non-termination). Afterwards, the SEG is transformed into an integer transition system (ITS) such that all paths through the SEG (and hence, the C program) are terminating if the ITS is terminating. To analyze termination of the ITS, AProVE applies standard techniques and calls the tools T2 [7] and LoAT [12, 13] to detect non-termination of ITSs. However, like other termination tools for C, up to now AProVE supported dynamic data structures only in a very restricted way.

figure a

In this paper, we introduce a novel technique to analyze C programs on lists. In theprogram on the right, nondet_uint returns a random unsigned integer. The for loop creates a list of n random numbers if \(\texttt {n}> 0\).The while loop traverses this list via pointer arithmetic: Starting with tail, it computes the address of the next field of thecurrent element by adding the offset of the next field within a list to the address of the current list and dereferencing the computed address (i.e., the content of the next field). This is done by offsetof, defined in the C library stddef.h.Footnote 1 Since the list is acyclic and the next pointer of its last element is the null pointer, list traversal always terminates. Of course, the while loop could also traverse the list via ptr = ptr->next, but in C, memory accesses can be combined with pointer arithmetic. This example contains both the access via curr->next (when initializing the list) and pointer arithmetic (when traversing the list).

We present a new general technique to infer list invariants via symbolic execution, which express all properties that are crucial for memory safety and termination. In our example, the list invariant contains the information that dereferencing the next pointer in the while loop is safe and that one finally reaches the null pointer. In general, our novel list invariants allow us to abstract from detailed information about lists (e.g., about their intermediate elements) such that abstract states with “similar” lists can be merged and generalized during the symbolic execution in order to obtain finite SEGs. At the same time, list invariants express enough information about the lists (e.g., their length, their start address, etc.) such that memory safety and termination can still be proved.

We define the abstract states used for symbolic execution in Sect. 2. In Sect. 3, after recapitulating the construction of SEGs, we adapt our techniques for merging and generalizing states from [25] to infer list invariants. Moreover, we adaptthose rules for symbolic execution that are affected by introducing list invariants. Section 4 discusses the generation of ITSs and the soundness of our approach. Section 5 gives an overview on related work. Moreover, we evaluate the implementation of our approach in the tool AProVE using benchmark sets from SV-COMP [3] and the Termination Competition [15]. All proofs can be found in [18].

Limitations. To ease the presentation, in this paper we treat integer types as unbounded. Moreover, we assume that a program consists of a single non-recursive function and that values may be stored at any address. Our approach can also deal with bitvectors, data alignments, and programs with arbitrary many (possibly recursive) functions, see [11, 16, 25] for details. However, so far only lists without sharing can be handled by our new technique. Extending it to more general recursive data structures is one of the main challenges for future work.

2 Abstract States for Symbolic Execution

The LLVM code for the for loop is given on the next page. It is equivalent to the code produced by Clang without optimizations on a 64-bit computer. We explain it in detail in Sect. 3. To ease readability, we omitted instructions and keywords that are irrelevant for our presentation, renamed variables, and wrote listinstead of struct.list. Moreover, we gave the C instructions ( ) before the corresponding LLVM code. The code consists of several basic blocks including cmpF and bodyF (corresponding to the loop comparison and body).

figure c

We now recapitulate the abstract states of [25] used for symbolic execution and extend them by a component \( LI \) for list invariants, i.e., they have the form \(((\texttt{b}, i), LV , AL , PT , LI , KB )\). The first component is a program position (\(\texttt{b}\), i), indicating that instructioni of block \(\texttt{b}\) is executed next. \( Pos \subseteq ( Blks \,\times \,\mathbb {N})\) is the set of all program positions, and \( Blks \) are all basic blocks.

The second component is a partial injective function \( LV :\mathcal {V}_{\mathcal {P}}\rightharpoonup \mathcal {V}_{ sym }\), which maps local program variables \(\mathcal {V}_{\mathcal {P}}\) of the program \(\mathcal {P}\) to an infinite set \(\mathcal {V}_{ sym }\) of symbolic variables with \(\mathcal {V}_{ sym }\cap \mathcal {V}_{\mathcal {P}}= \varnothing \). We identify \( LV \) with the set of equations \(\{ \texttt {x}\,=\, LV (\texttt {x}) \mid \texttt {x} \in \mathop {domain}( LV )\}\) and we often extend \( LV \) to a function from \(\mathcal {V}_{\mathcal {P}}\uplus \mathbb {N}\) to \(\mathcal {V}_{ sym }\uplus \mathbb {N}\) by defining \( LV (n) = n\) for all \(n \in \mathbb {N}\).

The third component of each state is a set \( AL \) of (bytewise) allocations \(\llbracket {}v_1,\,v_2\rrbracket \) with \(v_1, v_2 \in \mathcal {V}_{ sym }\), which indicate that \(v_1 \le v_2\) and that all addresses between \(v_1\) and \(v_2\) have been allocated. We require any two entries \(\llbracket {}v_1,\,v_2\rrbracket \) and \(\llbracket {}w_1,\,w_2\rrbracket \) from \( AL \) with \(v_1 \ne w_1\) or \(v_2 \ne w_2\) to be disjoint.

The fourth and fifth components \( PT \) and \( LI \) model the memory contents. \( PT \) contains “points-to” entries of the form \(v_1 \hookrightarrow _{\texttt{ty}} v_2\) where \(v_1,v_2 \in \mathcal {V}_{ sym }\) and \(\texttt{ty}\) is an LLVM type, meaning that the address \(v_1\) of type \(\texttt{ty}\) points to \(v_2\). In contrast, the set \( LI \) of list invariants (which is new compared to [25]) does not describe pointwise memory contents but contains invariants where \(n\in \mathbb {N}_{>0}\), \(v_{ ad },v_\ell ,v_i,\hat{v}_i \in \mathcal {V}_{ sym }\), \( off _i \in \mathbb {N}\) for all \(1 \le i \le n\), \(\texttt{ty}\) and \(\texttt{ty}_i\) are LLVM types for all \(1 \le i \le n\), and there is exactly one “recursive field” \(1 \le j \le n\) such that \(\texttt {ty}_j = \texttt {ty*}\).Footnote 2 Such an invariant represents a struct ty with n fields that corresponds to a recursively defined list of length \(v_\ell \). Here, \(v_{ ad }\) points to the first list element, the i-th field starts at address \(v_{ ad } + off _i\) (i.e., with offset \( off _i\))Footnote 3 and has type \(\texttt{ty}_i\), and the values of the i-th fields of the first and last listelement are \(v_i\) and \(\hat{v}_i\), respectively. For example, the following list invariant (1) represents all lists of length \(x_{\ell }\) and type list whose elements store a 32-bit integer in their first field and the pointer to the next element in their second fieldwith offset 8. The first list element starts at address \(x_\texttt {mem}\), the second starts at address \(x_\texttt {next}\), and the last element contains the null pointer. Moreover, the first ele-ment stores the integer value \(x_\texttt {nd}\) and the last list element stores the integer \(\hat{x}_\texttt {nd}\).

(1)

For example, this invariant represents the list with the allocation \(\llbracket {}x_{\texttt {mem}},\,x_{\texttt {mem}}+15\rrbracket \), where the first four bytes store the integer 5 and the last eight bytes store the pointer \(x_{\texttt {next}}\), and the allocation \(\llbracket {}x_{\texttt {next}},\,x_{\texttt {next}}+15\rrbracket \), where the first four bytes store the integer 2 and the last eight bytes store the null pointer (i.e., the address 0). Here, we have \(x_{\ell } = 2\). Section 3.2.2 will show that the expressiveness of our list invariants is indeed needed to prove termination of programs that traverse a list.

The last component of a state is a knowledge base \( KB \) of quantifier-free first-order formulas that express integer arithmetic properties of \(\mathcal {V}_{ sym }\). We identify sets of first-order formulas \(\{\varphi _1, \ldots , \varphi _m\}\) with their conjunction \(\varphi _1 \wedge \ldots \wedge \varphi _m\).

A special state \( ERR \) is reached if we cannot prove absence of undefined beha-vior (e.g., if memory safety might be violated by dereferencing the null pointer).

As an example, the following abstract state (2) represents concrete states at the beginning of the block cmpF, where the program variable curr is assigned the symbolic variable \(x_{\texttt {mem}}\), the allocation \(\llbracket {}x_{\texttt {k\_ad}},\,x_{\texttt {k\_ad}}^ end \rrbracket \) consisting of 4 bytes stores the value \(x_{\texttt {kinc}}\), and \(x_{\texttt {mem}}\) points to the first element of a list of length \(x_\ell \) (equal to \(x_{\texttt {kinc}}\)) that satisfies the list invariant (1). (This state will later be obtained during the symbolic execution, see State O in Fig. 3 in Sect. 3.1.)

figure e

A state \(s= (p, LV , AL , PT , LI , KB )\) is represented by a formula \(\langle {s}\rangle \) which contains \( KB \) and encodes \( AL \), \( PT \), and \( LI \) in first-order logic. This allows us to use standard SMT solving for all reasoning during the construction of the SEG. Moreover, \(\langle {s}\rangle \) is also used for the generation of the ITS afterwards. The encoding of \( AL \) and \( PT \) is as in [25], see [18]: \(\langle {s}\rangle \) contains formulas which express that allocated addresses are positive, that allocations represent disjoint memory areas, that equal addresses point to equal values, and that addresses are different if they point to different values. For each element of \( LI \), we add the following new formulas to \(\langle {s}\rangle \) which express that the list length \(v_\ell \) is \(\ge 1\) and the ad-dress \(v_{ ad }\) of the first element is not null. If \(v_\ell = 1\), then the values \(v_i\) and \(\hat{v}_i\) of the fields in the first and the last element are equal. If \(v_\ell \ge 2\), then the next pointer \(v_j\) in the first element must not be null. Finally, if there is a field whose values \(v_k\) and \(\hat{v}_k\) differ in the first and the last element, then the length \(v_\ell \) must be \(\ge 2\).

figure f

In concrete states c, all values of variables and memory contents are determined uniquely. To ease the formalization, we assume that all integers are unsigned and refer to [16] for the general case. So for all \(v \, \in \, \mathcal {V}_{ sym }(c)\) (i.e., all \(v \, \in \, \mathcal {V}_{ sym }\) occurring in c) we have \(\models \langle {c}\rangle \Rightarrow v = n\) for some \(n \in \mathbb {N}\). Moreover, here \( PT \) only contains information about allocated addresses and \( LI = \varnothing \) since the abstract knowledge in list invariants is unnecessary if all memory contents are known.

Fig. 1.
figure 1

SEG for the First Iteration of the for Loop

For instance, all concrete states \(((\texttt {cmpF},0), LV , AL , PT ,\varnothing , KB )\) represented by the state (2) contain \(\ell \) allocations of 16 bytes for some \(\ell \ge 1\), where in the first four bytes a 32-bit integer is stored and in the last eight bytes the address of the next allocation (or 0, in case of the last allocation) is stored.

See [18] for a formal definition to determine which concrete states are represented by a state s. To this end, as in [25] we define a separation logic formula \(\langle {s}\rangle _{ SL }\) which also encodes the knowledge contained in the memory components of states. To extend this formula to list invariants, we use a fragment similar to quantitative separation logic [4], extending conventional separation logic by list predicates. For any state s, we have \(\models \langle {s}\rangle _{ SL } \Rightarrow \langle {s}\rangle \), i.e., \(\langle {s}\rangle \) is a weakened version of \(\langle {s}\rangle _{ SL }\) that we use for symbolic execution and the termination proof.

3 Symbolic Execution with List Invariants

We first recapitulate the construction of SEGs. Then, Sect. 3.1 extends the technique for merging and generalization of states from [25] to infer list invariants. Finally, we adapt the rules for symbolic execution to list invariants in Sect. 3.2.

Our symbolic execution starts with a state \(A\) at the first instruction of the first block (called entry in our example). Figure 1 shows the first iteration of the for loop. Dotted arrows indicate that we omitted some symbolic execution steps. For every state, we perform symbolic execution by applying the corresponding inference rule as in [25] to compute its successor state(s) and repeat this until all paths end in return states. We call an SEG with this property complete.

As an example, we recapitulate the inference rule for the load instruction in the case where a value is loaded from allocated and initialized memory. It loads the value of type ty that is stored at the address ad to the program variable x. Let \( size (\texttt{ty})\) denote the size of \(\texttt{ty}\) in bytes for any LLVM type \(\texttt{ty}\). If we can prove that there is an allocation \(\llbracket {}v_1,\,v_2\rrbracket \) containing all addresses \( LV (\texttt {ad}), \ldots , LV (\texttt {ad})+ size (\texttt {ty})-1\) and there exists an entry \((w_1 \hookrightarrow _{\texttt {ty}} w_2) \in PT \) such that \(w_1\) is equal to the address \( LV (\texttt {ad})\) loaded from, then we transform the state s at position \(p = (\texttt{b},i)\) to a state \(s'\) at position \(p^+ = (\texttt{b},i+1)\). In \(s'\), a fresh symbolic variable w is assigned to x and \(w = w_2\) is added to \( KB \). We write \( LV [\texttt {x} := w]\) for the function where \( LV [\texttt {x} := w](\texttt {x}) = w\) and \( LV [\texttt {x} := w](\texttt {y}) = LV (\texttt {y})\) for all \(\texttt {y} \ne \texttt {x}\).

figure g

In our example, the entry block comprises the first three lines of the C program and the initialization of the pointer to the loop variable \(\texttt {k}\): First, a non-deterministic unsigned integer is assigned to n, i.e., \((\texttt {n} = v_\texttt {n}) \in LV ^{B}\), where \(v_\texttt {n}\) is not restricted. Moreover, memory for the pointers tail_ptr and k_ad is allocated and they point to tail = NULL and k = 0, respectively (\(\texttt {tail\_ptr} = v_{\texttt {tp}}\) and \(\texttt {k\_ad} = v_{\texttt {k\_ad}}\) with \((v_{\texttt {tp}} \hookrightarrow _{\texttt {list*}} 0), (v_{\texttt {k\_ad}} \hookrightarrow _{\texttt {i32}} 0) \in PT ^B\)). For simplicity, in Fig. 1 we use concrete values directly instead of introducing fresh variables for them. Since we assume a 64-bit architecture, tail_ptr’s allocation contains 8 bytes. For the integer value of k, only 4 bytes are allocated. Alignments and pointer sizes depend on the memory layout and are given in the LLVM program.

State \(C\) results from \(B\) by evaluating the load instruction at \((\texttt {cmpF}{}, 0)\), see the above load rule. Thus, there is an evaluation edge from \(B\) to \(C\).

The next instruction is an integer comparison whose Boolean return value depends on whether the unsigned value of k is less than the one of n. If we cannot decide the validity of a comparison, we refine the state into two successor states. Thus, the states \(D\) and \(E\) (with \((v_\texttt {n} > 0) \in KB ^D\) and \((v_\texttt {n} \le 0) \in KB ^E\)) are reached by refinement edges from State \(C\). Evaluating \(D\) yields \(\texttt {kltn} = 1\) in \(F\). Therefore, the branch instruction leads to the block bodyF in State \(G\). State \(E\) is evaluated to a state with \(\texttt {kltn} = 0\). This path branches to the block initPtr and terminates quickly as tail_ptr points to an empty list.

The instruction at \((\texttt {bodyF}{}, 0)\) allocates 16 bytes of memory starting at \(v_{\texttt {mem}}\) in State \(H\). The next instruction casts the pointer to the allocation from i8* to list* and assigns it to curr. Now the allocated area can be treated as a list element. Then nondet_uint() is invoked to assign a 32-bit integer to nondet. The getelementptr instruction computes the address of the integer field of the list element by indexing this field (the second i32 0) based on the start address (curr). The first index (i32 0) specifies that a field of *curr itself is computed and not of another list stored after *curr. Since the address of the integer value of the list element coincides with the start address of the list element, this instruction assigns \(v_\texttt {mem}\) to curr_val. Afterwards, the value of nondet is stored at curr_val (\(v_{\texttt {mem}} \hookrightarrow _{\texttt {i32}} v_\texttt {nd}\)), the value 0 stored at \(v_\texttt {tp}\) is loaded to tail, and a second getelementptr instruction computes the address of the recursive field of the current list element (\(v_{\texttt {cn}} = v_{\texttt {mem}} + 8\)) and assigns it to curr_next, leading to state \(J\). In the path to \(K\), the values of tail and curr are stored at curr_next and tail_ptr, respectively (\(v_{\texttt {cn}} \hookrightarrow _{\texttt {list*}} 0\), \(v_{\texttt {tp}} \hookrightarrow _{\texttt {list*}} v_{\texttt {mem}}\)). Finally, the incremented value of k is assigned to kinc and stored at k_ad (\(v_{\texttt {k\_ad}} \hookrightarrow _{\texttt {i32}} 1\)).

Fig. 2.
figure 2

Second Iteration of the for Loop

To ensure a finite graph construction, when a program position is reached for the second time, we try to merge the states at this position to a generalized state. However, this is only meaningful if the domains of the \( LV \) functions of the two states coincide (i.e., the states consider the same program variables). Therefore, after the branch from the loop body back to cmpF (see State \(L\) in Fig. 2), we evaluate the loop a second time and reach \(M\). Here, a second list element with value \(w_\texttt {nd}\) and a next pointer \(w_\texttt {cn}\) pointing to \(v_\texttt {mem}\) has been stored at a new allocation \(\llbracket {}w_\texttt {mem}, w_\texttt {mem}^ end \rrbracket \). Now, curr points to the new element and k has been incremented again, so k_ad points to 2.

figure h

3.1 Inferring List Invariants and Generalization of States

As mentioned, our goal is to merge \(L\) and \(M\) to a more general state \(O\) that represents all states which are represented by \(L\) or \(M\). The challenging part duringgeneralization is to find loop invariants automatically that always hold at this position and provide sufficient information to prove termination of the loop. For \(O\),we can neither use the information that curr points to a struct whose next field contains the null pointer (as in \(L\)), nor that its next field points to another struct whose next field contains the null pointer (as in \(M\)).

With the approach of [25], when merging states like \(L\) and \(M\) where a list hasdifferent lengths, the merged state would only contain those list elements that are allocated in both states (often this is only the first element). Then elements which are the null pointer in one but not in the other state are lost. Hence, proving memory safety (and thus, also termination) fails when the list is traversed afterwards, since now there might be next pointers to non-allocated memory.

We solve this problem by introducing list invariants. In our example, we willinfer an invariant stating that curr points to a list of length \(x_\ell \ge 1\). This invariant also implies that all struct fields are allocated and that there is no sharing.

To this end, we adapt the merging heuristic from [25]. To merge two states s and \(s'\) at the same program position with \(\mathop {domain}( LV ^s) = \mathop {domain}( LV ^{s'})\), we introduce a fresh symbolic variable \(x_\texttt{var}\) for each program variable \(\texttt{var}\) and use instantiations \(\mu _s\) and \(\mu _{s'}\) which map \(x_\texttt{var}\) to the corresponding symbolic variables of \(s\) and \({s'}\). For the merged state \({\overline{s}}\), we set \( LV ^{\overline{s}}(\texttt{var}) = x_\texttt{var}\). Moreover, we identify corresponding variables that only occur in the memory components and extend \(\mu _s\) and \(\mu _{s'}\) accordingly. In a second step, we check which constraints from the memory components and the knowledge base hold in both states in order to find invariants that we can add to the memory components and the knowledge base of \({\overline{s}}\). For example, if \(\llbracket {}\mu _s(x),\,\mu _s(x^{end})\rrbracket \in AL ^s\) and \(\llbracket {}\mu _{s'}(x),\,\mu _{s'}(x^{end})\rrbracket \in AL ^{s'}\) for \(x, x^{end} \in \mathcal {V}_{ sym }\), then \(\llbracket {}x,\,x^{end}\rrbracket \) is added to \( AL ^{\overline{s}}\). To extend this heuristic to lists, we have to regard several memory entries together. If there is an \(\texttt{ad} \in \mathcal {V}_{\mathcal {P}}\) such that \(\mu _s(x_\texttt{ad}) = v_1^ start \) and \(\mu _{s'}(x_\texttt{ad}) = w_1^ start \) both point to lists of type \(\texttt{ty}\) but of different lengths \(\ell _s\ne \ell _{s'}\) with \(\ell _s, \ell _{s'}\ge 1\), then we create a list invariant.

For a state s we say that \(v_1^ start \) points to a list of type \(\texttt{ty}\) with n fields and length \(\ell _s\) with allocations \(\llbracket {}v^ start _k,\,v^ end _k\rrbracket \) and values \(v_{k,i}\) (for \(1 \le k \le \ell _s\) and \(1 \le i \le n\)) if the following conditions (a)–(d) hold:

(a):

\(\texttt{ty}\) is an LLVM struct type with subtypes \(\texttt{ty}_i\) and field offsets \( off _i \in \mathbb {N}\) for all \(1 \le i \le n\) such that there exists exactly one \(1 \le j \le n\) with \(\texttt{ty}_j = \mathtt {ty*}\).

(b):

There exist pairwise different \(\llbracket {}v^ start _k,\,v^ end _k\rrbracket \in AL ^s\) for all \(1 \le k \le \ell _s\) and \(\models \, \langle {s}\rangle \Rightarrow v^ end _k = v^ start _k + size (\texttt{ty})-1\).

(c):

For all \(1 \le k \le \ell _s\) and \(1 \le i \le n\) there exist \(v^ start _{k,i}, v_{k,i} \in \mathcal {V}_{ sym }\) with \(\models \, \langle {s}\rangle \Rightarrow v^ start _{k,i} = v^ start _k + off _i\) and \((v^ start _{k,i} \hookrightarrow _{\texttt{ty}_i} v_{k,i}) \in PT ^s\).

(d):

For all \(1 \le k < \ell _s\) we have \(\models \, \langle {s}\rangle \Rightarrow v_{k,j} = v^ start _{k+1}\).

Condition (a) states that ty is a list type with n fields, where the pointer to the next element is in the j-th field. In (b) we ensure that each list element has a unique allocation of the correct size where \(v_1^ start \) is the start address of the first allocation. Condition (c) requires that for the k-th element, the initial address plus the i-th offset points to a value \(v_{k,i}\) of type \(\texttt{ty}_i\). Finally, (d) states that the recursive field of each element indeed points to the initial address of the next element.

Then, for fresh \(x_\ell ,x_i,\hat{x}_i \in \mathcal {V}_{ sym }\), we add the following list invariant to \( LI ^{\overline{s}}\).

(3)

To ensure that the allocations expressed by the list invariant are disjoint from all allocations in \( AL ^{\overline{s}}\), we do not use the list allocations \(\llbracket {}v^ start _k,\,v^ end _k\rrbracket \) to infer generalized allocations in \( AL ^{\overline{s}}\). Similarly, to create \( PT ^{\overline{s}}\), we only use entries \(v \hookrightarrow _{\texttt{ty}} w\) from \( PT ^s\) and \( PT ^{s'}\) where v is disjoint from the list addresses, i.e., where \(\models \langle {s}\rangle \Rightarrow v < v_k^ start \vee v > v_k^ end \) holds for all \(1 \le k \le \ell _s\) and analogously for \(s'\). Moreover, we add formulas to \( KB ^{\overline{s}}\) stating that (A) the length \(x_\ell \) of the list is at least the smaller length of the merged lists, (B) \(x_\ell \) is equal to all variables x which result from merging variables v and w that are equal to the lengths \(\ell _s\) and \(\ell _{s'}\) in \(s\) and \({s'}\), and (C) the symbolic variable \(x_i\) for the value of the i-th field of the first list element is equal to all variables x with \(\mu _s(x) = v_{1,i}\) and \(\mu _{s'}(x) = w_{1,i}\) where \(v_{1,i}\) and \(w_{1,i}\) are the values of the i-th field of the first list element in s and \(s'\) (and analogously for the values \(\hat{x}_i\) of the last list element):

(A):

\(\min (\ell _s,\ell _{s'}) \le x_\ell \)

(B):

\(\bigwedge _{x \in \mu ^{-1}_s(v) \cap \mu ^{-1}_{s'}(w)} x_\ell = x\) for all \(v,w \in \mathcal {V}_{ sym }\) with \(\models \, \langle {s}\rangle \Rightarrow v = \ell _s\) and \(\models \, \langle {{s'}}\rangle \Rightarrow w = \ell _{s'}\)

(C):

\(\bigwedge _{x \in \mu ^{-1}_s(v_{1,i}) \cap \mu ^{-1}_{s'}(w_{1,i})} x_i = x\) and \(\bigwedge _{x \in \mu ^{-1}_s(v_{\ell _{s},i}) \cap \mu ^{-1}_{s'}(w_{\ell _{{s'}},i})} \hat{x}_i = x\) for all \(1 \le i \le n\)

Fig. 3.
figure 3

Merging of States

To identify the variables in the list invariant (3) of \({\overline{s}}\) with the corresponding values in \(s\) and \({s'}\), the instantiations \(\mu _s\) and \(\mu _{s'}\) are extended such that \(\mu _s(x_\ell ) = \ell _s\), \(\mu _{s'}(x_\ell ) = \ell _{s'}\), \(\mu _s(x_i) = v_{1,i}\), \(\mu _{s'}(x_i) = w_{1,i}\), \(\mu _s(\hat{x}_i) = v_{\ell _{s},i}\), and \(\mu _{s'}(\hat{x}_i) = w_{\ell _{{s'}},i}\) for all \(1 \le i \le n\). Similarly, if there already exist list invariants in \(s\) and \({s'}\), for each pair of corresponding variables a new variable is introduced and mapped to its origin by \(\mu _s\) and \(\mu _{s'}\). This adaption of the merging heuristic only concerns the result of merging but not the rules when to merge two states. Thus, the same reasoning as in [25] can be used to prove soundness and termination of merging.

In our example, \(L\) and \(M\) contain lists of length \(\ell _L= 1\) and \(\ell _M= 2\). To ease the presentation, we re-use variables that are known to be equal instead of introducing fresh variables. If \(x_\texttt {mem}\) is the variable for the program variable curr, we have \(\mu _L(x_\texttt{mem}) = v_\texttt {mem}\) and \(\mu _M(x_\texttt{mem}) = w_\texttt {mem}\). Indeed, \(v_\texttt {mem}\) resp. \(w_\texttt {mem}\) points to a list with values \(v_{k,i}\) resp. \(w_{k,i}\) as defined in (a)–(d): For the type list with \(n=2\), \(\texttt{ty}_1 = \texttt{i32}\), \(\texttt{ty}_2 = \mathtt {list*}\), \( off _1 = 0\), \( off _2 = 8\), and \(j = 2\) (see (a)), we have \(\llbracket {}v_\texttt{mem},\,{v_{\texttt{mem}}^ end }\rrbracket \in AL ^L\) and \(\llbracket {}v_\texttt{mem},\,{v_{\texttt{mem}}^ end }\rrbracket \), \( \llbracket {}w_\texttt{mem},\,{w_{\texttt{mem}}^ end }\rrbracket \in AL ^M\), all consisting of \( size (\texttt{list}) = 16\) bytes, see (b). We have \((v_{\texttt {mem}} \hookrightarrow _{\texttt {i32}} v_{\texttt {nd}}), (v_{\texttt {cn}} \hookrightarrow _{\texttt {list*}} 0) \in PT ^L\) with \((v_{\texttt {cn}} = v_{\texttt {mem}} + 8) \in KB ^L\) and \((v_{\texttt {mem}} \hookrightarrow _{\texttt {i32}} v_{\texttt {nd}}), (v_{\texttt {cn}} \hookrightarrow _{\texttt {list*}} 0), (w_{\texttt {mem}} \hookrightarrow _{\texttt {i32}} w_{\texttt {nd}}), (w_{\texttt {cn}} \hookrightarrow _{\texttt {list*}} v_{\texttt {mem}}) \in PT ^M\) with \((v_{\texttt {cn}} = v_{\texttt {mem}} + 8), (w_{\texttt {cn}} = w_{\texttt {mem}} + 8) \in KB ^M\) (see (c)), so the first list element in \(M\) points to the second one (see (d)). Therefore, when merging \(L\) and \(M\) to a new state \(O\) (see Fig. 3), the lists are merged to a list invariant of variable length \(x_\ell \) and we add the formulas (A) \(1 \le x_\ell \) and (B) \(x_\ell = x_\texttt {kinc}\) to \( KB ^O\). By (C), the \(\texttt{i32}\) value of the first element is identified with \(x_\texttt {nd}\), since \(\mu _L(x_\texttt {nd})\) is equal to the first value of the first list element in \(L\) and \(\mu _M(x_\texttt {nd})\) is equal to the first value of the first list element in \(M\). Similarly, the values of the last list elements are identified with 0, as in \(L\) and \(M\).

After merging s and \(s'\) to a generalized state \(\overline{s}\), we continue symbolic execution from \(\overline{s}\). The next time we reach the same program position, we might have to merge the corresponding states again. As described in [25], we use a heuristic for constructing the SEG which ensures that after a finite number of iterations, a state is reached that only represents concrete states that are also represented by an already existing (more general) state in the SEG. Then symbolic execution cancontinue from this more general state instead. So with this heuristic, the construction always ends in a complete SEG or an SEG containing the state \( ERR \).

We formalized the concept of “generalization” by a symbolic execution rule in [25]. Here, the state \(\overline{s}\) is a generalization of s if the conditions \((g1)-(g6)\) hold.

Condition (g1) prevents cycles consisting only of refinement and generalization edges in the graph. Condition (g2) states that the instantiation \(\mu :\mathcal {V}_{ sym }(\overline{s}) \rightarrow \mathcal {V}_{ sym }(s) \cup \mathbb {Z}\) maps symbolic variables from the more general state \(\overline{s}\) to their counterparts from the more specific state s such that they correspond to the same program variable. Conditions (g3)–(g6) ensure that all knowledge present in \(\overline{ KB }\), \(\overline{ AL }\), \(\overline{ PT }\), and \(\overline{ LI }\) still holds in s with the applied instantiation.

figure i

Condition (g6) is new compared to [25] and takes list invariants into account. So for every list invariant \(\overline{ l }\) of \(\overline{s}\) there is either a corresponding list invariant \( l \) in s such that lists represented by \( l \) in s are also represented by \(\overline{ l }\) in \(\overline{s}\), or there is a concrete list in s that is represented by \(\overline{ l }\) in \(\overline{s}\). The last condition of the latter case ensures that disjointness between the memory domains of \(\overline{ PT }\) and \(\overline{ LI }\) is preserved. See [18] for the soundness proof of the extended generalizationrule, i.e., that every concrete state represented by s is also represented by \(\overline{s}\).

Our merging technique always yields generalizations according to this rule, i.e., the edges from \(L\) and \(M\) to \(O\) in Fig. 3 are generalization edges. Here, onechooses \(\mu _L\) and \(\mu _M\) such that \(\mu _L(x_\texttt{mem}) = v_\texttt{mem}\), \(\mu _L(x_\ell ) = 1\), \(\mu _L(x_\texttt{nd}) = v_\texttt{nd}\), \(\mu _L(\hat{x}_\texttt{nd})= v_\texttt{nd}\), \(\mu _L(x_\texttt{next}) = 0\), \(\mu _M(x_\texttt{mem}) = w_\texttt{mem}\), \(\mu _M(x_\ell ) = 2\), \(\mu _M(x_\texttt{nd}) = w_\texttt{nd}\), \(\mu _L(\hat{x}_\texttt{nd}) = v_\texttt{nd}\), and \(\mu _M(x_\texttt{next}) = v_\texttt{mem}\). In both cases, all conditions of the second case of (g6) with \(\ell _L= 1\) and \(\ell _M= 2\) are satisfied. With \(\mu _L(x_\texttt{kinc}) = 1\) resp. \(\mu _M(x_\texttt{kinc}) = 2\), we also have \(\models \langle {L}\rangle \Rightarrow \mu _L(x_\ell ) = \mu _L(x_\texttt{kinc})\) resp. \(\models \langle {M}\rangle \Rightarrow \mu _M(x_\ell ) = \mu _M(x_\texttt{kinc})\).

3.2 Adapting List Invariants

To handle and modify list invariants, three of our symbolic execution rules have to be changed. Section 3.2.1 presents a variant of the store rule where the list invariant is extended by an element. In Sect. 3.2.2, we adapt the load rule to load values from the first list element and we present a variant of the getelementptr rule for list traversal. Soundness of our new rules is proved in [18]. For all other instructions, the symbolic execution rules from [25] remain unchanged.

Fig. 4.
figure 4

Extending a List Invariant

3.2.1 List Extension

After merging \(L\) and \(M\), symbolic execution continues from the more general state \(O\) in Fig. 3. Here, the values of k and kinc and the length of the list are not concrete but any positive (resp. non-negative) value with \(x_\ell = x_\texttt {kinc} = x_\texttt {k}+1\). The symbolic execution of \(O\) is similar to the steps from \(B\) to \(J\) in Sect. 3 (see Fig. 1). First, the value \(x_\texttt{kinc}\) stored at k_ad is loaded to k. To distinguish whether k < n still holds, the next state is refined. From the refined state with k < n, we enter the loop body again. A new block \(\llbracket {}y_{\texttt{mem}},\,y_{\texttt{mem}}^ end \rrbracket \) of 16 bytes is allocated and \(y_{\texttt{mem}}\) is assigned to mem and curr. Then, a new unknown value \(y_\texttt {nd}\) is assigned to nondet. The address of the i32 value of the current element (equal to \(y_{\texttt {mem}}\)) is computed by the first getelementptr instruction of the loop and the value \(y_\texttt{nd}\) of nondet is stored at it. The second getelementptr instruction computes the address \(y_{\texttt {cn}}\) of the recursive field and results in State \(P\) in Fig. 4, where \(y_{\texttt {cn}} = y_{\texttt {mem}}+8\) is added to \( KB ^P\). Now, store sets the address of the next field to the head of the list created in the previous iteration. Since this instruction extends the list by an element, instead of adding \(y_{\texttt {cn}} \hookrightarrow _{\texttt {list*}} x_{\texttt {mem}}\) to \( PT ^Q\), we extend the list invariant: The length is set to \(y_\ell \) and identified with \(x_\ell +1\) in \( KB ^Q\). The pointer \(x_{\texttt {mem}}\) to the first element is replaced by \(y_{\texttt {mem}}\), while the first recursive field in the list gets the value \(x_{\texttt {mem}}\). Since \((y_{\texttt {mem}} \hookrightarrow _{\texttt {i32}} y_{\texttt {nd}}) \in PT ^P\), \(y_\texttt{nd}\) is the value of the first i32 integer in the list. We remove all entries from \( PT ^Q\) that are already contained in the new list invariant, e.g., \(y_{\texttt {mem}} \hookrightarrow _{\texttt {i32}} y_{\texttt {nd}}\).

To formalize this adaption of list invariants, we introduce a modified rule for store in addition to the one in [25]. It handles the case where there is a concrete list at some address \(v^{ start }\), pa points to the m-th field of this list’s first element, one wants to store a value \( t \) at the address pa, and one already has a listinvariant l for the “tail” of the list in the j-th field (if \(m \ne j\)) resp. for the list at the address \( t \) (if \(m = j\)). In all other cases, the ordinary store rule is applied.

figure j

More precisely, let the list invariant l describe a list of length \(v_l\) at the address \(v_ ad \). Then l is replaced by a new list invariant \(l'\) which describes the list at the address \(v^{ start }\) after storing t at the address pa. Irrespective of whether \(m \ne j\) or \(m = j\), the resulting list at \(v^{ start }\) has the list at \(v_ ad \) as its “tail” and thus, its length \(v_\ell '\) is \(v_\ell +1\). We prevent sharing of different elements by removing the allocation \(\llbracket {}v^ start ,\,v^ end \rrbracket \) of the list and all points-to information of pointers in \(\llbracket {}v^ start ,\,v^ end \rrbracket \).

figure k

3.2.2 List Traversal

After the current element \(y_\texttt {mem}\) is stored at \(x_\texttt {tp}\) and the value \(x_\texttt {kinc}\) of \(\texttt{k}\) is incremented to \(y_\texttt {kinc}\) and stored at \(x_\texttt {k\_ad}\), we reach a state \(R\) atposition \((\texttt {cmpF}, 0)\) by the branch instruction. However, our already existing state \(O\) is more general than \(R\), i.e., we can draw a generalization edge from \(R\) to \(O\)using the generalization rule with the instantiation \(\mu _R\) where \(\mu _R(x_\texttt{mem}) = y_\texttt{mem}\), \(\mu _R(x_\texttt{nd}) = y_\texttt{nd}\), \(\mu _R(x_\texttt{cn}) = y_\texttt{cn}\), \(\mu _R(x_\texttt{k}) = x_\texttt{kinc}\), \(\mu _R(x_\texttt{kinc}) = y_\texttt{kinc}\), \(\mu _R(x_\ell ) = y_\ell \), \(\mu _R(\hat{x}_\texttt{nd}) = \hat{x}_\texttt{nd}\), and \(\mu _R(x_\texttt{next}) = x_\texttt{mem}\). Thus, the cycle of the first loop closes here.

figure l
figure m

As mentioned, in the path from \(O\) to \(R\) there is a state at position \((\texttt {cmpF}, 1)\) which is refined (similar to State \(C\)). If k < n holds, we reach \(R\). The other path with leads out of the for loop to the block initPtr followed by the while loop (see State \(S\) and the corresponding LLVM code on the side). The value \(x_\texttt{mem}\) at address \(\mathtt {tail\_ptr}\) is loaded to tail’ and stored at a new pointer variable \(\texttt{ptr}\). State \(T\) is reached after the first iteration of the while loop body. Here, block cmpW loads the value \(x_\texttt{mem}\) stored at ptr to str. Since it is not the null pointer, we enter bodyW, which corresponds to the body of the while loop. First, \(x_\texttt{mem}\) is cast to an i8 pointer. Then getelementptr computes a pointer \(x_\texttt{np}\) to the next element by adding 8 bytes to \(x_\texttt{mem}\). After another cast back to a \(\texttt {list*}\) pointer, we load the content of the new pointer to next. To this end, we need the following new variant of the load rule to load values that are described by a list invariant.

Fig. 5.
figure 5

Traversing a List Invariant

figure o

With this new load rule, the content of the new pointer is identified as \(x_\texttt{next}\). It is loaded to next and stored at \(x_\texttt{ptr}\). Then we return to the block cmpW (State \(T\)). Merging \(T\) with its predecessor at the same program position is not possible yet since the domains of the respective \( LV \) functions do not coincide. Now, \(x_\texttt{next}\) is loaded to str and compared to the null pointer. Since we do not have information about \(x_\texttt{next}\), \(T\)’s successor state is refined to a state with \(x_\texttt{next} = 0\) (which starts a path out of the loop to a return state), and to a state with \(x_\texttt{next} \ge 1\), which reaches \(U\) after a few evaluation steps, see Fig. 5. Now, getelementptr computes the pointer \(x_\texttt{np}' = x_\texttt{next} + 8\) to the third element of the list, which is assigned to next_ptr. \(\langle {U}\rangle \) contains \(x_\ell \ge 2\) since the first and the last pointer value are known to be different (\(x_\texttt{next} \ne 0\)). This information is crucial for creating a new list invariant starting at \(x_\texttt{next}\), which is used in the next iteration of the loop. Therefore, if our list invariant did not contain variables for the first and the last pointer, we could not prove termination of the program. In such a case where the pointer to the third element of a list invariant is computed and the length of the list is at least two, we traverse the list invariant to retain the correspondence between the computed pointer \(x'_\texttt {np}\) and the new list invariant. In the resulting state \(V\), we represent the first list element by an allocation \(\llbracket {}x_\texttt {mem},\,x_\texttt {mem}^ end \rrbracket \) and preserve all knowledge about this element that was encoded in the list invariant (\(x_\texttt {mem}^ end = x_{\texttt {mem}} + 15\), \(x_{\texttt {mem}} \hookrightarrow _{\texttt {i32}} x_{\texttt {nd}}\), \(x_{\texttt {np}} \hookrightarrow _{\texttt {list*}} x_{\texttt {next}}\)). Moreover, we adapt the list invariant such that it now represents the list at \(x_\texttt{next}\) (i.e., without its first element) starting with the value \(x'_\texttt {nd}\). We also relate the length of the new list invariant to the length of the former one (\(x'_{\ell } = x_{\ell } - 1\)).

Thus, in addition to the rule for getelementptr in [25], we now introduce rules for list traversal via getelementptr. The rule below handles the case where the address calculation is based on the type i8 and the getelementptr instruction adds the number of bytes given by the term t to the address pa. Here, the offsets in our list invariants are needed to compute the address of the accessed field. We also have similar rules for list traversal via field access (i.e., where the next element is accessed using curr’->next as in the for loop) and for the case where we cannot prove that the length \(v_{\ell }\) of the list is at least 2, see [18].

figure p

We continue the symbolic execution of State \(V\) in our example and finally obtain a complete SEG with a path from a state W at the position \((\texttt {cmpW}, 0)\) to the next state \(W'\) at this position, and a generalization edge back from \(W'\) to W using an instantiation \(\mu _{W'}\). Both W and \(W'\) contain a list invariant similar to T where instead of the length \(x_\ell \) in T, we have the symbolic variables \(z_\ell \) and \(z_\ell '\) in W and \(W'\), where \(\mu _{W'}(z_\ell ) = z_\ell '\) (see [18] for more details).

figure q

4 Proving Termination

To prove termination of a program \(\mathcal {P}\), as in [25] the cycles of the SEG are translated to an integer transition system whose termination implies termination of \(\mathcal {P}\). The edges of the SEG are transformed into ITS transitions whose application conditions consist of the state formulas \(\langle {s}\rangle \) and equations to identify corresponding symbolic variables of the different states. For evaluation and refinement edges, the symbolic variables do not change. For generalization edges, we use the instantiation \(\mu \) to identify corresponding symbolic variables. In our example, the ITS has cyclic transitions of the following form:

figure r

The first cycle resulting from the generalization edge from \(R\) to \(O\) terminates since k is increased until it reaches n. The generalization edge yields a condition identifying \(x_\texttt{kinc}\) in \(R\) with \(x_\texttt{k}\) in \(O\), since \(\mu _R(x_\texttt{k}) = x_\texttt{kinc}\). With the conditions \(x_\texttt {kinc} = x_\texttt {k} + 1\) and \(x_\texttt {n} > x_\texttt {k}\) (from \( KB ^O\)), the resulting transitions of the ITS are terminating. The second cycle from the generalization edge from \(W'\) to \(W\) terminates since the length of the list starting with curr’ decreases. Although there is no program variable for the length, due to our list invariants the states contain variables for this length, which are also passed to the ITS. Thus, the ITS contains the variable \(z_{\ell }\) (where \(z_{\ell }\) in \(W\) is identified with \(z'_{\ell }\) in \(W'\) due to \(\mu _{W'}(z_{\ell }) = z'_{\ell }\)). Since the condition \(z'_{\ell } = z_{\ell } - 1\) is obtained on the path from \(W\) to \(W'\) and \(z_{\ell } \ge 1\) is part of \(\langle {W}\rangle \) due to the list invariant with length \(z_{\ell }\) in \( LI ^W\), the resulting transitions of the ITS clearly terminate. Analogous to [25, Cor. 11 and Thm. 13], we obtain the following theorem. To prove that a complete SEG represents all program paths, in [25] we used the LLVM semantics defined by the Vellvm project [26]. One now also has to prove soundness of those symbolic execution rules which were modified due to the new concept of list invariants (i.e., generalization, list extension, and list traversal), see [18].

Theorem 1 (Memory Safety and Termination)

Let \(\mathcal {P}\) be a program with a complete SEG \(\mathcal {G}\). Since a complete SEG does not contain \( ERR \), \(\mathcal {P}\) is memory safe for all concrete states represented by the states in \(\mathcal {G}\).Footnote 4 If the ITS corresponding to \(\mathcal {G}\) is terminating, then \(\mathcal {P}\) is also terminating for all states represented by \(\mathcal {G}\).

5 Conclusion, Related Work, and Evaluation

We presented a new approach for automated proofs of memory safety and termination of C/LLVM-programs on lists. It first constructs a symbolic execution graph (SEG) which overapproximates all program runs. Afterwards, an integer transition system (ITS) is generated from this graph whose termination is proved using standard techniques. The main idea of our new approach is the extension of the states in the SEG by suitable list invariants. We developed techniques to infer and modify list invariants automatically during the symbolic execution.

During the construction of the SEG, the list invariants abstract from a concrete number of memory allocations to a list of allocations of variable length while preserving knowledge about some of the contents (the values of the fields of the first and the last element) and the list shape (the start address of the first element, the list length, and the content of the last recursive pointer which allows us to distinguish between cyclic and acyclic lists). They also contain information on the memory arrangement of the list fields which is needed for programs that access fields via pointer arithmetic. The symbolic variables for the list length and the first and last values of list elements are preserved when generating an ITS from the SEG. Thus, they can be used in the termination proof of the ITS (e.g., the variables for list length can occur in ranking functions).

In [5, 6, 22] we developed a technique for termination analysis of Java, based on a program transformation to integer term rewrite systems instead of ITSs. This approach does not require specific list invariants as recursive data structures on the heap are abstracted to terms. However, these terms are unsuitable for C, since they cannot express memory allocations and the connection to their contents.

Separation logic predicates for termination of list programs were also used in, but their list predicates only consider the list length and the recursive field, but no other fields or offsets. The tools Cyclist [24] and HipTNT+ [19] are integrated in separation logic systems which also allow to define heap predicates. However, they require annotations and hints which parameters of the list predicates are needed as a termination measure. The tool 2LS [20] also provides basic support for dynamic data structures. But all these approaches are not suitable if termination depends on the contents or the shape of data structures combined withpointer arithmetic. In [10], programs can be annotated with arithmetic and structural properties to reason about termination. In contrast, our approach does not need hints or annotations, but finds termination arguments fully automatically.

We implemented our approach in AProVE [25]. While C programs with lists are very common, existing tools can hardly prove their termination. Therefore, the current benchmark collections for termination analysis contain almost no list programs. In 2017, a benchmark setFootnote 5 of 18 typical C-programs on lists was added to the Termination category of the Competition on Software Verification (SV-COMP) [3], where 9 of them are terminating. Two of these 9 programs do not need list invariants, because they just create a list without operating on it afterwards. The remaining seven terminating programs create a list and then traverse it, search for a value, or append lists and compute the length afterwards. Only few tools in SV-COMP produced correct termination proofs for programs from this set: HipTNT+ and 2LS failed for all of them. CPAchecker [2] and PeSCo [23] proved termination and non-termination for one of these programs in 2020. UAutomizer [8] proved termination for two and non-termination for seven programs. The termination proofs of CPAchecker, PeSCo, and UAutomizer only concern the programs that just create a list. Our new version of AProVE is the only termination proverFootnote 6 that succeeds if termination depends on the shape or contents of a list after its creation. Note that for non-termination, a proof is a single non-terminating program path, so here list invariants are less helpful.

For the Termination Competition [15] 2022, we submitted 18 terminating Cprograms on listsFootnote 7 (different from the ones at SV-COMP), where two of them just create a list. Three traverse it afterwards (by a loop or recursion), and ten search for a value, where for nine, also the list contents are relevant for termination. Three programs perform common operations like inserting or deleting an element. UAutomizer proves termination for a program that just creates a list but not for programs operating on the list afterwards. With our approach, AProVE succeeds on 17 of the 18 programs. Overall, AProVE and UAutomizer were the two most powerful tools for termination of C in SV-COMP 2022 and the Termination Competition 2022, with UAutomizer winning the former and AProVE winningthe latter competition. To download AProVE, run it via its web interface, and for details on our experiments, see https://aprove-developers.github.io/recursive_structs.

 

SV-C T.

SV-C Non-T.

TermCmp T.

AProVE

7 (of 9)

5 (of 9)

17 (of 18)

UAutomizer

2 (of 9)

7 (of 9)

1 (of 18)