A Decision Tree Lifted Domain for Analyzing Program Families with Numerical Features

Lifted (family-based) static analysis by abstract interpretation is capable of analyzing all variants of a program family simultaneously, in a single run without generating any of the variants explicitly. The elements of the underlying lifted analysis domain are tuples, which maintain one property per variant. Still, explicit property enumeration in tuples, one by one for all variants, immediately yields combinatorial explosion. This is particularly apparent in the case of program families that, apart from Boolean features, contain also numerical features with large domains, thus giving rise to astronomical configuration spaces. The key for an efficient lifted analysis is a proper handling of variability-specific constructs of the language (e.g., feature-based runtime tests and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt {\#if}$$\end{document}#if directives). In this work, we introduce a new symbolic representation of the lifted abstract domain that can efficiently analyze program families with numerical features. This makes sharing between property elements corresponding to different variants explicitly possible. The elements of the new lifted domain are constraint-based decision trees, where decision nodes are labeled with linear constraints defined over numerical features and the leaf nodes belong to an existing single-program analysis domain. To illustrate the potential of this representation, we have implemented an experimental lifted static analyzer, called SPLNum \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^2$$\end{document}2 Analyzer, for inferring invariants of C programs. An empirical evaluation on BusyBox and on benchmarks from SV-COMP yields promising preliminary results indicating that our decision trees-based approach is effective and outperforms the baseline tuple-based approach.


Introduction
Many software systems today are configurable [6]: they use features (or configurable options) to control the presence and absence of functionality. Different family members, called variants, are derived by switching features on and off, while the reuse of common code is maximized, leading to productivity gains, shorter time to market, greater market coverage, etc. Program families (e.g., software product lines) are commonly seen in the development of commercial embedded software, such as cars, phones, avionics, medicine, robotics, etc. Configurable options (features) are used to either support different application scenarios for embedded components, to provide portability across different hardware platforms and configurations, or to produce variations of products for different market segments or customers. We consider here program families implemented using #if directives from the C preprocessor CPP [20]. They use #if-s to specify in which conditions parts of code should be included or excluded from a variant. Classical program families use only Boolean features that have two values: on and off. However, Boolean features are insufficient for real-world program families, as there exist features that have a range of numbers as possible values. These features are called numerical features [25]. For instance, Linux kernel, BusyBox, Apache web server, Java Garbage Collector represent some real-world program families with numerical features. Analyzing such program families is very challenging, due to the fact that from only a few features, a huge number of variants can be derived.
In this paper, we are concerned with the verification of program families with Boolean and numerical features using abstract interpretation-based static analysis. Abstract interpretation [7,24] is a general theory for approximating the semantics of programs. It provides sound (all confirmative answers are correct) and efficient (with a good trade-off between precision and cost) static analyses of run-time properties of real programs. It has been used as the foundation for various successful industrial-scale static analyzers, such as ASTREE [8]. Still, the static analysis of program families is harder than the static analysis of single programs, because the number of possible variants can be very large (often huge) in practice. The simplest brute-force approach that uses a preprocessor to generate all variants of a family, and then applies an existing off-the-shelf single-program analyzer to each individual variant, one-by-one, is very inefficient [3,27]. Therefore, we use so-called lifted (family-based) static analyses [3,22,27], which analyze all variants of the family simultaneously without generating any of the variants explicitly. They take as input the common code base, which encodes all variants of a program family, and produce precise analysis results corresponding to all variants. They use a lifted analysis domain, which represents an n-fold product of an existing single-program analysis domain used for expressing program properties (where n is the number of valid configurations). That is, the lifted analysis domain maintains one property element per valid variant in tuples. The problem is that this explicit property enumeration in tuples becomes computationally intractable with larger program families because the number of variants (i.e., configurations) grows exponentially with the number of features. This problem has been successfully addressed for program families that contain only Boolean features [1,2,11], by using sharing through binary decision diagrams (BDDs). However, the fundamental limitation of existing lifted analysis techniques is that they are not able to handle numerical features.
To overcome this limitation, we present a new, refined lifted abstract domain for effectively analyzing program families with numerical features by means of abstract interpretation. The elements of the lifted abstract domain are constraintbased decision trees, where the decision nodes are labelled with linear constraints over numerical features, whereas the leaf nodes belong to a single-program analysis domain. The decision trees recursively partition the space of configurations (i.e., the space of possible combinations of feature values), whereas the program properties at the leaves provide analysis information corresponding to each partition, i.e. to the variants (configurations) that satisfy the constraints along the path to the given leaf node. The partitioning is dynamic, which means that partitions are split by feature-based tests (at #if directives), and joined when merging the corresponding control flows again. In terms of decision trees, this means that new decision nodes are added by feature-based tests and removed when merging control flows. In fact, the partitioning of the set of configurations is semantics-based, which means that linear constraints over numerical features that occur in decision nodes are automatically inferred by the analysis and do not necessarily occur syntactically in the code base.
Our lifted abstract domain is parametric in the choice of numerical property domain [7,24] that underlies the linear constraints over numerical features labelling decision nodes, and the choice of the single-program analysis domain for leaf nodes. In fact, in our implementation, we also use numerical property domains for leaf nodes, which encode linear constraints over program variables. We rely on the well-known numerical domains, such as intervals [7], octagons [23], polyhedra [10], from the APRON library [19] to obtain a concrete decision tree-based implementation of the lifted abstract domain. This way, we have implemented a forward reachability analysis of C program families with numerical (and Boolean) features for the automatic inference of invariants. Our tool, called SPLNum 2 Analyzer 4 , computes a set of possible invariants, which represent linear constraints over program variables. We can use the implemented lifted static analyzer to check invariance properties of C program families, such as assertions, buffer overflows, null pointer references, division by zero, etc [8].
In summary, we make several contributions: (1) We propose a new, parameterized lifted analysis domain based on decision trees for analyzing program families with numerical features; (2) We implement a prototype lifted static analyzer, SPLNum 2 Analyzer, that performs a forward analysis of #if-enriched C programs, where numerical property domains from the APRON library are used as parameters in the lifted analysis domain; (3) We evaluate our approach for automatic inference of invariants by comparing performances of lifted analyzers based on tuples and decision trees.

Motivating Example
To illustrate the potential of a decision tree-based lifted domain, we consider a motivating example using the code base of the following program family SIMPLE: The set F of features is {B, SIZE}, where B is a Boolean feature and SIZE is a numerical feature whose domain is [1,4] The code of SIMPLE contains two #if directives, which change the value assigned to y, depending on how features from F are set at compile-time. For each configuration from K, a different variant (single program) can be generated by appropriately resolving #if-s. For example, the variant corresponding to configuration B ∧ (SIZE = 1) will have B and SIZE set to true and 1, so that the assignment y := y+1 and skip in program locations 4 and 5 , respectively, will be included in this variant. The variant for configuration ¬B ∧ (SIZE = 4) will have features B and SIZE set to false and 4, so the assignments y := y-1 and y := 0 in program locations 4 and 5 , respectively, will be included in this variant. There are |K| = 8 variants that can be derived from the family SIMPLE.
Assume that we want to perform lifted polyhedra analysis of SIMPLE using the Polyhedra numerical domain [10]. The standard lifted analysis domain used in the literature [3,22] is defined as cartesian product of |K| copies of the basic analysis domain (e.g. polyhedra). Hence, elements of the lifted domain are tuples containing one component for each valid configuration from K, where each component represents a polyhedra linear constraint over program variables (x and y in this case). The lifted analysis result in location 7 of SIMPLE is an 8-sized tuple shown in Fig. 1. Note that the first component of the tuple in Fig. 1 corresponds to configuration B ∧ (SIZE = 1), the second to B ∧ (SIZE = 2), the third to B ∧ (SIZE = 3), and so on. We can see in Fig. 1 that the polyhedra analysis discovers very precise results for the variable y: (y = 10) for configurations B ∧ (SIZE = 1), B ∧ (SIZE = 2), and B ∧ (SIZE = 3); (y = −10) for configuration B ∧ (SIZE = 4); and (y = 0) for all other configurations. This is due to the fact that the polyhedra domain is fully relational and is able to track all relations between program variables x and y. Using this result in location 7 , we can successfully conclude that the assertion is valid for configurations B∧(SIZE = 1), B∧(SIZE = 2), and B ∧ (SIZE = 3), whereas the assertion fails for all other configurations.
If we perform lifted polyhedra analysis based on the decision tree domain proposed in this work, then the corresponding decision tree inferred in the final program location 7 of SIMPLE is depicted in Fig. 2. Notice that the inner nodes of the decision tree in Fig. 2 are labeled with Interval linear constraints over features (SIZE and B), while the leaves are labeled with the Polyhedra linear constraints over program variables x and y. Hence, we use two different numerical abstract domains in our decision trees: Interval domain [7] for expressing properties in decision nodes, and Polyhedra domain [10] for expressing properties  in leaf nodes. The edges of decision trees are labeled with the truth value of the decision on the parent node; we use solid edges for true (i.e. the constraint in the parent node is satisfied) and dashed edges for false (i.e. the negation of the constraint in the parent node is satisfied). As decision nodes partition the space of valid configurations K, we implicitly assume the correctness of linear constraints that take into account domains of numerical features. For example, the node with constraint [1,4] of SIZE. We can see that decision trees offer more possibilities for sharing and interaction between analysis properties corresponding to different configurations, they provide symbolic and compact representation of lifted analysis elements. For example, Fig. 2 presents polyhedra properties of two program variables x and y, which are partitioned with respect to features B and SIZE. When (B ∧ (SIZE ≤ 3)) is true the shared property is (y = 10, x = 0), whereas when (B ∧ ¬(SIZE ≤ 3)) is true the shared property is (y = −10, x = 0). When ¬B is true, the property is independent from the value of SIZE, hence a node with a constraint over SIZE is not needed. Therefore, all such cases are identical and so they share the same leaf node (y = 0, x = 0). In effect, the decision tree-based representation uses only three leafs, whereas the tuple-based representation uses eight properties. This ability for sharing is the key motivation behind the decision trees-based representation.

A Language for Program Families
Let F = {A 1 , . . . , A k } be a finite and totaly ordered set of numerical features available in a program family. For each feature A ∈ F, dom(A) ⊆ Z denotes the set of possible values that can be assigned to A. Note that any Boolean feature can be represented as a numerical feature B ∈ F with dom(B) = {0, 1}, such that 0 means that feature B is disabled while 1 means that B is enabled. A valid combination of feature's values represents a configuration k, which specifies one variant of a program family. It is given as a valuation function k : which is a mapping that assigns a value from dom(A) to each feature A, i.e. k(A) ∈ dom(A) for any A ∈ F. We assume that only a subset K of all possible configurations are valid. An alternative representation of configurations is based upon propositional formulae. Each configuration k ∈ K can be represented by a formula: . We often abbreviate (B = 1) with B and (B = 0) with ¬B, for a Boolean feature B ∈ F. The set of valid configurations K can be also represented as a formula: ∨ k∈K k. We define feature expressions, denoted FeatExp(F), as the set of propositional logic formulas over constraints of F generated by the grammar: Example 1. For the SIMPLE program family from Section 2, the set of features is F = {B, SIZE} where dom(SIZE) = [1,4], and the set of configurations is We consider a simple sequential non-deterministic programming language, which will be used to exemplify our work. The program variables Var are statically allocated and the only data type is the set Z of mathematical integers. To encode multiple variants, a new compile-time conditional statement is included. The new statement "#if (θ) s #endif" contains a feature expression θ ∈ FeatExp(F) as a presence condition, such that only if θ is satisfied by a configuration k ∈ K the statement s will be included in the variant corresponding to k. The syntax is: where n ranges over integers, [n, n ] over integer intervals, x over program variables Var, and ⊕ over binary arithmetic operators. Integer intervals [n, n ] denote a random choice of an integer in the interval. The set of all statements s is denoted by Stm; the set of all expressions e is denoted by Exp.
A program family is evaluated in two stages. First, the C preprocessor CPP takes a program family s and a configuration k ∈ K as inputs, and produces a variant (without #if-s) corresponding to k as the output. Second, the obtained variant is evaluated using the standard single-program semantics. The first stage is specified by the projection function P k , which is an identity for all basic statements and recursively pre-processes all sub-statements of compound statements. Hence, P k (skip) = skip and P k (s;s ) = P k (s);P k (s ). The interesting Fig. 3a, Fig. 3b, Fig. 3c, and Fig. 3d, respectively, are derived from the SIMPLE family defined in Section 2.

Lifted Analysis based on Tuples
Lifted analyses are designed by lifting existing single-program analyses to work on program families, rather than on individual programs. Lifted Domain. The lifted analysis domain is defined as A K ,˙ ,˙ ,˙ ,⊥,˙ , where A K is shorthand for the |K|-fold product k∈K A, that is, there is one separate copy of A for each configuration of K. For example, consider the tuple in Fig. 1.
Lifted Abstract Operations. Given a tuple (lifted domain element) a ∈ A K , the projection π k selects the k th component of a. All abstract lifted operations are defined by lifting the abstract operations of the domain A configuration-wise.
Lifted Transfer Functions. We now define lifted transfer functions for tests, forward assignments (ASSIGN), and #if-s (IFDEF). There are two types of tests: expression-based tests, denoted FILTER, that occur in while-s and ifs, and feature-based tests, denoted FEAT-FILTER, that occur in #if-s. Each lifted transfer function takes as input a tuple from A K representing the invariant before evaluating the statement (resp., expression) to handle, and returns a tuple representing the invariant after evaluating the given statement (resp., expression).
is the lifted transfer function for statement s. FILTER and ASSIGN are defined by applying FILTER A and ASSIGN A independently on each component of the input tuple a. FEAT-FILTER keeps those components k of the input tuple a that satisfy θ, otherwise it replaces the other components with ⊥ A . IFDEF captures the effect of analyzing the statement s in the components k of a that satisfy θ, otherwise it is an identity for the other components.
Lifted Analysis. Lifted abstract operators and transfer functions of the lifted analysis domain A K are combined together to analyze program families. Initially, we build a tuple a in where all components are set to A for the first program location, and tuples where all components are set to ⊥ A for all other locations. The analysis properties are propagated forward from the first program location towards the final location taking assignments, #if-s, and tests into account with join and widening around while-s. The soundness of the lifted analysis based on A K follows immediately from the soundness of all abstract operators and transfer functions of A (proved in [22]).

Numerical Lifted Analysis
The single-program analysis domain A can be instantiated by some of the well-known numerical property domains [24], such as Intervals I, I [7], Octagons O, O [26], and Polyhedra P, P [10]. The elements of I are intervals of the form: ±x ≥ β, where x ∈ Var, β ∈ Z; the elements of O are conjunctions of octagonal constraints of the form ±x 1 ± x 2 ≥ β, where x 1 , x 2 ∈ Var, β ∈ Z; while the elements of P are conjunctions of polyhedral constraints of the form

Lifted Analysis based on Decision Trees
We now introduce a new decision tree lifted domain. Its elements are disjunctions of leaf nodes that belong to an existing single-program domain A defined over program variables Var. The leaf nodes are separated by linear constraints over numerical features, organized in the decision nodes. Hence, we encapsulate the set of configurations K into the decision nodes of a decision tree where each topdown path represents one or several configurations that satisfy the constraints encountered along the given path. We store in each leaf node the property generated from the variants representing the corresponding configurations.
Abstract domain for decision nodes. We define the family of abstract domains for linear constraints C D , which are parameterized by any of the numerical property domains D (intervals I, octagons O, polyhedra P). We use to denote the set of polyhedral constraints. We have The set C D of linear constraints over features F is constructed by the underlying numerical property domain D, D using the Galois connection where P(C D ) is the power set of C D . The abstraction function α C D : P(C D ) → D maps a set of interval (resp., octagon, polyhedral) constraints to an interval (resp., an octagon, polyhedral) that represents a conjunction of constraints; the concretization function γ C D : D → P(C D ) maps an interval (resp., an octagon, a polyhedron) that represents a conjunction of constraints to a set of interval (resp., octagonal, polyhedral) constraints. We have The domain of decision nodes is C D . We assume F = {A 1 , . . . , A k } be a finite and totally ordered set of features, such that the ordering is We impose a total order < C D on C D to be the lexicographic order on the coefficients α 1 , . . . , α k and constant α k+1 of the linear constraints, such that: The negation of linear constraints is formed as: For example, the negation of A − 3 ≥ 0 is the constraint −A + 2 ≥ 0 (i.e., A ≤ 2). To ensure canonical representation of decision trees, a linear constraint c and its negation ¬c cannot both appear as nodes in a decision tree. For example, we only keep the largest constraint with respect to < C D between c and ¬c. For this reason, we define the equivalence relation ≡ C D as c ≡ C D ¬c. We define C D , < C D to denote C D / ≡ , < C D , such that elements of C D are constraints obtained by quotienting by the equivalence ≡ C D .
Abstract domain for constraint-based decision trees. A constraint-based decision tree t ∈ T(C D , A) over the sets C D of linear constraints defined over F and the leaf abstract domain A defined over Var is either a leaf node a with a ∈ A, or [[c : tl, tr]], where c ∈ C D (denoted by t.c) is the smallest constraint with respect to < C D appearing in the tree t, tl (denoted by t.l) is the left subtree of t representing its true branch, and tr (denoted by t.r) is the right subtree of t representing its false branch. The path along a decision tree establishes the set of configurations (those that satisfy the encountered constraints), and the leaf nodes represent the analysis properties for the corresponding configurations. Example 2. The following two constraint-based decision trees t 1 and t 2 have decision nodes labelled with Interval linear constraints over the numeric feature SIZE with domain {1, 2, 3, 4}, whereas leaf nodes are Interval properties: Abstract Operations. The concretization function γ T of a decision tree t ∈ T(C D , A) returns γ A (a) for k ∈ K, where k satisfies the set C ∈ P(C D ) of constraints accumulated along the top-down path to the leaf node a ∈ A. More formally, γ T (t) = γ T [K](t). The function γ T accumulates into a set C ∈ P(C D ) constraints along the paths up to a leaf node, which is initially equal to the set of implicit constraints over F, K = ∨ k∈K k, taking into account domains of features: . Therefore, we can check k |= C using the abstract operation D of the numerical domain D.
Other binary operations of T(C D , A) are based on Algorithm 1 for tree unification, which finds a common refinement (labelling) of two trees t 1 and t 2 by calling function UNIFICATION(t 1 , t 2 , K). It possibly adds new constraints as decision nodes (Lines 5-7, Lines 11-13), or removes constraints that are redundant (Lines 3,4,9,10,15,16). The function UNIFICATION accumulates into the set C ∈ P(C D ) (initialized to K, which represents implicit constraints satisfied by both t 1 and t 2 ), constraints encountered along the paths of the decision tree. This set C is used by the function isRedundant(c, C), which checks whether the linear constraint c ∈ C D is redundant with respect to C by testing α C D (C) D α C D ({c}). Note that the tree unification does not lose any information. Note that UNIFICATION adds a decision node for SIZE ≥ 2 to the right subtree of t 1 , whereas it adds a decision node for SIZE ≥ 4 to t 2 and removes the redundant constraint SIZE ≥ 2 from the resulting left subtree of t 2 .
All binary operations are performed leaf-wise on the unified decision trees. Given two unified decision trees t 1 and t 2 , their ordering and join are defined as:    t1.c, C) then return UNIFICATION(t1.l, t2, C  Transfer functions. The transfer functions for forward assignments (ASSIGN T ) and expression-based tests (FILTER T ) modify only leaf nodes of a constraintbased decision tree. In contrast, transfer functions for variability-specific constructs, such as feature-based tests (FEAT-FILTER T ) and #if-s (IFDEF T ) add, modify, or delete decision nodes of a decision tree. This is due to the fact that the analysis information about program variables is located in leaf nodes, while the information about feature variables is located in decision nodes.
Transfer function ASSIGN T for handling an assignment x:=e in the input tree t is described by Algorithm 2. Note that x ∈ Var, and e ∈ Exp may contain only program variables. We apply ASSIGN A to each leaf node a of t, which substitutes expression e for variable x in a. Similarly, transfer function FILTER T for handling expression-based tests e ∈ Exp is implemented by applying FILTER A leaf-wise.
Transfer function FEAT-FILTER T for feature-based tests θ is described by Algorithm 3. It reasons by induction on the structure of θ (we assume negation is applied to atomic propositions). When θ is an atomic constraint over numerical features (Lines 2,3), we use FILTER D to approximate θ, thus producing a set of constraints J, which are then added to the tree t, possibly discarding all paths of t that do not satisfy θ. This is done by calling function RESTRICT(t, K, J), which Algorithm 3: FEAT-FILTER T (t, θ) adds linear constraints from J to t in ascending order with respect to < C D as shown in Algorithm 4. Note that θ may not be representable exactly in C D (e.g., in the case of non-linear constraints over F), so FILTER D may produce a set of constraints approximating it. When θ is a conjunction (resp., disjunction) of two feature expressions (Lines 4,5) (resp., (Lines 6,7)), the resulting decision trees are merged by operation meet T (resp., join T ). Function RESTRICT(t, C, J), described in Algorithm 4, takes as input a decision tree t, a set C of linear constraints accumulated along paths up to a node, and a set J of linear constraints in canonical form that need to be added to t. For each constraint j ∈ J, there exists a boolean b j that shows whether the tree should be constrained with respect to j or with respect to ¬j. When J is not empty, the linear constraints from J are added to t in ascending order with respect to < C D . At each iteration, the smallest linear constraint j is extracted from J (Line 9), and is handled appropriately based on whether j is smaller (Line 11-15), or greater or equal (Line [17][18][19][20][21] to the constraint at the node of t we currently consider. Finally, transfer function IFDEF T is defined as: where [[s]] T (t) denotes the transfer function in T(C D , A) for statement s. After applying transfer functions, the obtained decision trees may contain some redundancy that can be exploited to further compress them. Function COMPRESS T (t, C), described by Algorithm 5, is applied to decision trees t in order to compress (reduce) their representation. We use five different optimizations. First, if constraints on a path to some leaf are unsatisfiable, we eliminate that leaf node (Lines 9,10). Second, if a decision node contains two same subtrees, then we keep only one subtree and we also eliminate the decision node (Lines [11][12][13]. Third, if a decision node contains a left leaf and a right subtree, such that its left leaf is the same with the left leaf of its right subtree and the constraint in the decision node is less or equal to the constraint in the root of its right subtree, then we can eliminate the decision node and its left leaf (Lines 14,15). A similar rule exists when a decision node has a left subtree and a right leaf (Lines 16,17).

Theorem 1. γ T [[s]] T (t in ) = γ [[s]](a in ) .
Proof. The proof is by induction on the structure of s. We consider the most interesting cases: #if (θ) s #endif. Transfer functions for #if are identical in both lifted domains. We only need to show that FEAT-FILTER(a, θ) and FEAT-FILTER T (t, θ) are identical. This is shown by induction on θ [13].
Example 5. Let us consider the code base of a program family P given in Fig. 4. It contains only one numerical feature SIZE with domain N. The decision tree inferred at the final location 4 is depicted in Fig. 5. It uses the Interval domain for both decision and leaf nodes. Note that the constraint (SIZE < 3) does not explicitly appear in the code base, but we obtain it in the decision tree representation. This shows that partitioning of the configuration space K induced by decision trees is semantics-based rather than syntactic-based.  Example 6. Let us consider the code base of a program family P given in Fig. 6. It contains one numerical feature A with domain [1,4] and a non-linear feature expression A * A < 9. At program location 2 , FEAT-FILTER T ( x = 0 , A * A < 9) returns an over-approximating tree . In effect, we obtain an over-approximating result at the final program location 3 as shown in Fig. 7.
The precise result at the program location 3 , which can be obtained in case we have numerical domains that can handle non-linear constraints, is given in Fig. 8. We observe that when ¬(A ≤ 2), we obtain an over-approximating analysis result (−1 ≤ x ≤ 1 instead of x = −1) due to the over-approximation of the non-linear feature expression in the numerical domains we use.

Evaluation
Implementation We have developed a prototype lifted static analyzer, called SPLNum 2 Analyzer, that uses lifted abstract domains of tuples A K and decision trees T(C D , A). The abstract domains A for encoding properties of tuple components and leaf nodes as well as the abstract domain D for encoding linear constraints over numerical features are based on intervals, octagons, and polyhedra domains. Their abstract operations and transfer functions are provided by the APRON library [19]. Our proof-of-concept implementation is written in OCaml and consists of around 6K lines of code. The current front-end of the tool accepts programs written in a (subset of) C with #if directives, but without struct and union types. It currently provides only a limited support for arrays, pointers, and recursion. The only basic data type is mathematical integers. SPLNum 2 Analyzer automatically infers numerical invariants in all program locations corresponding to all variants in the given family. We use delayed widening and narrowing [7,24] to improve the precision of while-s. , termination-crafted (crafted); as well as from the real-world BusyBox project (https://busybox.net). In the case of SV-COMP, we have first selected some numerical programs with integers, and then we have manually added variability (features and #if directives) in each of them. In the case of BusyBox, we have first selected some programs with numerical features, and then we have simplified those programs so that our tool can handle them. For example, any reference to a pointer or a library function is replaced with [−∞, +∞]. Table 1 presents characteristics of the benchmarks. We Performance Results Table 1 shows the results of analyzing our benchmark files by using different versions of our lifted static analyses based on decision trees and on tuples. For each version of decision tree-based lifted analysis, there are two columns. In the first column, Time, we report the running time in seconds to analyze the given benchmark using the corresponding version of lifted analysis based on decision trees. In the second column, Impr., we report the speed up factor for each version of lifted analysis based on decision trees relative to the corresponding baseline lifted analysis based on tuples (A T (I) vs. A Π (I), A T (O) vs. A Π (O), and A T (P ) vs. A Π (P )). The performance results confirm that sharing is indeed effective and especially so for large values of |K|. On our benchmarks, it translates to speed ups (i.e., (A T (−) vs. A Π (−)) that range from 1.1 to 4.6 times when |K| < 100, and from 3.7 to 32 times when |K| > 100.

Experimental setup and Benchmarks
Computational tractability The tuple-based lifted analysis A Π (−) may become very slow or even infeasible for very large configuration spaces |K|. We have tested the limits of A Π (P ) and A T (−). We took a method, test k n (), which contains n numerical features A 1 , . . . , A n , such that each numerical feature [i = 2] , Fig. 9: A Π (P ) results at 4 of test 3 2 (). Subject to the chosen configuration, the variable i in location 4 can have a value in the range from value 2 when A 1 and A 2 are assigned to 0, to value 0 when A 2 ≥ 1. The analysis results in location 4 of test 3 2 () obtained using A Π (P ) and A T (P ) are shown in Fig. 9 and Fig. 10, respectively. A Π (P ) uses tuples with 9 interval properties (components), while A T (P ) uses 3 interval properties (leafs). We have generated methods test k n () by gradually increasing variability. In general, the size of tuples used by A Π (P ) is k n , whereas the number of leaf nodes in decision trees used by A T (P ) in the final program location is n + 1. The performance results of analyzing test k n , for different values of n and k, using A Π (P ) and A T (P ) are shown in Table 2. In the columns Impr., we report the speed-up of A T (P ) with respect to A Π (P ). We observe that A T (P ) yields decision trees that provide quite compact and symbolic representation of lifted analysis results. Since the configurations with equivalent analysis results are nicely encoded using linear constraints in decision nodes, the performance of A T (P ) does not depend on k, but only depends on n. On the other hand, the performance of A Π (P ) heavily depends on k. Thus, within a timeout limit of 300 seconds, the analysis A Π (P ) fails to terminate for test 3 11 , test 5 8 , and test 7 6 . In summary, we can conclude that decision trees A T (P ) can not only greatly speed up lifted analyses, but also turn previously infeasible analyses into feasible.

Related Work
Decision-tree abstract domains have been successfully used in the field of abstract interpretation recently [18,9,4,26]. Decision trees have been applied for the disjunctive refinement of Interval domain [18]. That is, each element of the new domain is a propositional formula over interval linear constraints. Segmented decision tree abstract domains has also been defined [9,4] to enable path dependent static analysis. Their elements contain decision nodes that are determined either by values of program variables [9] or by the branch (if) conditions [4], whereas the leaf nodes are numerical properties. Urban and Mine [26] use decision tree-based abstract domains to prove program termination. Decision nodes are labelled with linear constraints that split the memory space and leaf nodes contain affine ranking functions for proving program termination.
Recently, two main styles of static analysis have been a topic of considerable research in the SPL community: a dataflow analysis from the monotone framework developed by Kildall [21] that is algorithmically defined on syntactic CFGs, and an abstract interpretation-based static analysis developed by Cousot and Cousot [7] that is more general and semantically defined. Brabrand et. al. [3] lift a dataflow analysis from the monotone framework, resulting in a tuple-based lifted dataflow analysis. Another efficient implementation of the lifted dataflow analysis from the monotone framework is based on using variational data structures [27]. Midtgaard et. al. [22] have proposed a formal methodology for systematic derivation of tuplebased lifted static analyses in the abstract interpretation framework. A more efficient lifted static analysis by abstract interpretation obtained by improving representation via BDD domains is given in [11]. Another approach to speed up lifted analyses is by using so-called variability abstractions [14,15], which are used to derive abstract lifted analyses. They tame the combinatorial explosion of the number of configurations and reduce it to something more tractable by manipulating the configuration space. The work [5] presents a model checking technique to analyze probabilistic program families.

Conclusion
In this work we employ decision trees and widely-known numerical abstract domains for automatic inference of invariants in all locations of C program families that contain numerical features. In future, we would like to extend the lifted abstract domain to also support non-linear constraints [17]. An interesting direction for future work would be to explore possibilities of applying variability abstractions [14] as yet another way to speed up lifted analyses. We can also define a backward lifted analysis in combination with a preliminary forward lifted analysis to infer the necessary preconditions in order a given assertion to be satisfied or violated. The obtained preconditions in the form of linear constraints can be analyzed using model counting techniques to quantify how likely is an input or a variant to satisfy them [16,12].