AutomataBased Model Counting for String Constraints
 29 Citations
 2 Mentions
 1.4k Downloads
Abstract
Most common vulnerabilities in Web applications are due to string manipulation errors in input validation and sanitization code. String constraint solvers are essential components of program analysis techniques for detecting and repairing vulnerabilities that are due to string manipulation errors. For quantitative and probabilistic program analyses, checking the satisfiability of a constraint is not sufficient, and it is necessary to count the number of solutions. In this paper, we present a constraint solver that, given a string constraint, (1) constructs an automaton that accepts all solutions that satisfy the constraint, (2) generates a function that, given a length bound, gives the total number of solutions within that bound. Our approach relies on the observation that, using an automatabased constraint representation, model counting reduces to path counting, which can be solved precisely. We demonstrate the effectiveness of our approach on a large set of string constraints extracted from realworld web applications.
Keywords
Regular Expression Model Counting Symbolic Execution String Constraint Relational Constraint1 Introduction
Since many computer security vulnerabilities are due to errors in string manipulating code, string analysis has become an active research area in the last decade [3, 9, 12, 17, 31, 36, 38, 39]. Symbolic execution is a wellknown automated bug detection technique which has been applied to vulnerability detection [28]. In order to apply symbolic execution to analysis of string manipulating programs, it is necessary to check satisfiability of string constraints [6]. Several string constraint solvers have been proposed in recent years to address this problem [1, 18, 19, 21, 23, 24, 32, 40].
There are two recent research directions that aim to extend symbolic execution beyond assertion checking. One of them is quantitative information flow, where the goal is to determine how much secret information is leaked from a given program [10, 26, 27, 29], and another one is probabilistic symbolic execution where the goal is to compute probability of the success and failure paths in order to establish reliability of the given program [7, 13]. Interestingly, both of these approaches require the same basic extension to constraint solving: They require a modelcounting constraint solver that not only determines if a constraint is satisfiable, but it also computes the number of satisfying instances.
In this paper, we present an automatabased modelcounting technique for string constraints that consists of two main steps: (1) Given a string constraint and a variable, we construct an automaton that accepts all the string values for that variable for which the string constraint is satisfiable. (2) Given an automaton we generate a function that takes a length bound as input and returns the total number of strings that are accepted by the automaton that have a length that is less than or equal to the given bound.
Our constraint language can handle regular language membership queries, word equations that involve concatenation and replacement, and arithmetic constraints on string lengths. For a class of constraints that we call pseudorelational, our approach gives the precise modelcount. For constraints that are not in this class our approach computes an upper bound. We implemented a tool called AutomataBased model Counter for string constraints (ABC) using the approach we present in this paper. Our experiments demonstrate that \(\textsc {ABC}\) is effective and efficient when applied to thousands of string constraints extracted from realworld web applications.
Related Work: Our inspiration for this work was the recently proposed modelcounting string constraint solver SMC [25]. Similar to SMC, we also utilize generating functions in modelcounting. However, due to some significant differences in how we utilize generating functions, our approach is strictly more precise than the approach used in SMC. For example, SMC cannot determine the precise model count for a regular expression constraint such as \(x \in (ab)^*  ab\), whereas our approach is precise for all regular expressions. More importantly, SMC cannot propagate string values across logical connectives which reduces its precision. For example, for a simple constraint such as \((x \in a  b) \ \vee \ (x \in a  b  c  d)\) SMC will generate a modelcount range which consists of an upper bound of 6 and a lower bound of 2, whereas our approach will generate the exact count which is 4. Moreover, SMC always generates a lower bound of 0 for conjunctions that involve the same variable. So, the range generated for \((x \in a b) \ \wedge \ (x \in a  b  c  d)\) would be 0 to 2, whereas our approach generates the exact count which is 2. The set of constraints we handle is also larger than the constraints that SMC can handle. In particular, we can handle constraints with replace operations which is common in serverside input sanitization code.
There has been significant amount of work on string constraint solving in recent years [1, 15, 18, 19, 21, 23, 24, 28, 32, 40]. Some of these constraints solvers bound the string length [21, 23, 28] whereas our approach handles strings of arbitrary length. None of these string constraint solvers provide modelcounting functionality. Our modalcounting constraint solver, ABC, builds on the automatabased string analysis tool Stranger [36, 38, 39], which was determined to be the best in terms of precision and efficiency in a recent empirical study for evaluating string constraint solvers for symbolic execution of Java programs [20]. In addition to checking satisfiability, \(\textsc {ABC}\) also generates an automaton that accepts all possible solutions and provides modelcounting capability. To the best of our knowledge, \(\textsc {ABC}\) is the only tool that supports all of these. In addition to enabling quantitative and probabilistic analysis by model counting, our constraint solver also enables automated program repair synthesis by generating a characterization of all solutions [2, 37].
2 Automata Construction for String Constraints
In this section, we discuss how to construct automata for string constraints. Given a constraint and a variable, our goal is to construct an automaton that accepts all strings, which, when assigned as the value of the variable in the given constraint, results in a satisfiable constraint.
2.1 String Constraint Language

\(\textsc {contains}(v, s) \Leftrightarrow \exists s_1, s_2 \in \varSigma ^* : v = s_1 s s_2\)

\(\textsc {begins}(v, s) \Leftrightarrow \exists s_1 \in \varSigma ^* : v = s s_1\)

\(\textsc {ends}(v, s) \Leftrightarrow \exists s_1 \in \varSigma ^* : v = s_1 s\)

\(n = \textsc {indexof}(v, s) \Leftrightarrow (\textsc {contains}(v, s) \ \wedge \ (\exists s_1, s_2 \in \varSigma ^*: \textsc {len}(s_1) = n \ \wedge \ v = s_1 s s_2) \ \wedge \ (\forall i < n : \lnot (\exists s_1, s_2 \in \varSigma ^*: \textsc {len}(s_1) = i \ \wedge \ v = s_1 s s_2))) \ \vee \ (\lnot \textsc {contains}(v, s) \ \wedge \ n = 1)\)

\(v = \textsc {replace}(v', s_1, s_2) \Leftrightarrow (\exists s_3, s_4, s_5 \in \varSigma ^* : v' = s_3 s_1 s_4 \ \wedge \ v = s_3 s_2 s_5 \ \wedge \ s_5 = \textsc {replace}(s_4, s_1, s_2) \ \wedge \ (\forall s_6, s_7 \in \varSigma ^* : v'=s_6 s_1 s_7 \Rightarrow \textsc {len}(s_6) \ge \textsc {len}(s_3))) \ \vee \ (\lnot \textsc {contains}(v', s_1) \ \wedge \ v = v')\)
Given a constraint F, let \(V_F\) denote the set of variables that appear in F. Let F[s / v] denote the constraint that is obtained from F by replacing all appearances of \(v \in V_F\) with the string constant s. We define the truth set of the formula F for variable v as \([\![F,v ]\!]= \{ s \  \ F[s/v] \ \text{is} \text{satisfiable}\}\).
We identify three classes of constraints: (1) Singlevariable constraints are constructed using at most one string variable (i.e., \(V_F = \{ v\}\) or \(V_F = \emptyset \)), they do not contain constraints of type (4), (6), and (11), and have a single variable on the left hand side of constraints of type (3). (2) Pseudorelational constraints: are a set of constraints that we define in the next section, for which the truth sets are regular (i.e., each \([\![F,v ]\!]\) is a regular set). (3) Relational constraints are the constraints that are not pseudorelational constraints (truth sets of relational constraints can be nonregular).
2.2 Mapping Constraints to Automata
A Deterministic Finite Automaton (DFA) A is a 5tuple \((Q, \varSigma , \delta , q_0, F)\), where \(Q = \{1,2,\ldots ,n\}\) is the set of n states, \(\varSigma \) is the input alphabet, \(\delta \subseteq Q \times Q \times \varSigma \) is the state transition relation set, \(q_0 \in Q\) is the initial state, and \(F \subseteq Q\) is the set of final, or accepting, states.
Given an automaton A, let \(\mathcal{L}(A)\) denote the set of strings accepted by A. Given a constraint F and a variable v, our goal is to construct an automaton A, such that \(\mathcal{L}(A) = [\![F,v ]\!]\).
Automata Construction for SingleVariable Constraints: Let us define an automata constructor function \(\mathcal{A}\) such that, given a formula F and a variable v, \(\mathcal{A}(F,v)\) is an automaton where \(\mathcal{L}(\mathcal{A}(F,v)) = [\![F,v ]\!]\). In this section we discuss how to implement the automata constructor function \(\mathcal{A}\).

case \(V_F = \emptyset \) (i.e., there are no variables in F): Evaluate the constraint F. If \(F \equiv \mathbf{true} \) then \(\mathcal{A}(F,v) =\mathcal{A}(\varSigma ^*)\), otherwise \(\mathcal{A}(F,v) =\mathcal{A}(\emptyset )\).

case \(F \equiv \lnot F_1\): \(\mathcal{A}(F,v)\) is constructed using \(\mathcal{A}(F_1,v)\) and it is an automaton that accepts the complement language \(\varSigma ^*  \mathcal{L}(\mathcal{A}(F_1,v))\).

case \(F \equiv F_1 \ \wedge \ F_2\) or \(F \equiv F_1 \ \vee \ F_2\): \(\mathcal{A}(F,v)\) is constructed using \(\mathcal{A}(F_1,v)\) and \(\mathcal{A}(F_2,v)\) using automata product, and it accepts the language \(\mathcal{A}(F_1,v) \cap \mathcal{A}(F_2,v)\) or \(\mathcal{A}(F_1,v) \cup \mathcal{A}(F_2,v)\), respectively.

case \(F \equiv v \in R\): \(\mathcal{A}(F,v)\) is constructed using regular expression to automata conversion algorithm and accepts all strings that match the regular expression R.

case \(F \equiv v = s\): \(\mathcal{A}(F,v) = \mathcal{A}(s)\).

case \(F \equiv \textsc {len}(v) = n\): \(\mathcal{A}(F,v) = \mathcal{A}(\varSigma ^n)\).

case \(F \equiv \textsc {len}(v) < n\): \(\mathcal{A}(F,v)\) is an automaton that accepts the language \( \{ \varepsilon \} \cup \varSigma ^1 \cup \varSigma ^2 \cup \ldots \cup \varSigma ^{n1}\).

case \(F \equiv \textsc {len}(v) > n\): \(\mathcal{A}(F,v)\) is constructed using \(\mathcal{A}(\varSigma ^{n+1})\) and \(\mathcal{A}(\varSigma ^*)\) and then constructing an automaton that accepts the concatenation of those languages, i.e., \(\varSigma ^{n+1} \varSigma ^{*}\).

case \(F \equiv \textsc {contains}(v,s)\): \(\mathcal{A}(F,v)\) is an automaton that is constructed using \(\mathcal{A}(\varSigma ^*)\) and \(\mathcal{A}(s)\) and it accepts the language \(\varSigma ^* s \varSigma ^*\).

case \(F \equiv \textsc {begins}(v,s)\): \(\mathcal{A}(F,v)\) is constructed using \(\mathcal{A}(\varSigma ^*)\) and \(\mathcal{A}(s)\), and it accepts the language \(s \varSigma ^*\).

case \(F \equiv \textsc {ends}(v,s)\): \(\mathcal{A}(F,v)\) is constructed using \(\mathcal{A}(\varSigma ^*)\) and \(\mathcal{A}(s)\), and it accepts the language \(\varSigma ^* s\).

case \(F \equiv n = \textsc {indexof}(v,s)\): Let \(L_i\) denote the language \(\varSigma ^i s \varSigma ^*\). Automata that accept the languages \(L_i\) can be constructed using \(\mathcal{A}(\varSigma ^i)\), \(\mathcal{A}(s)\), and \(\mathcal{A}(\varSigma ^*)\). Then \(\mathcal{A}(F,v)\) is the automaton that accepts the language \(\varSigma ^n s \varSigma ^*  (\{ \varepsilon \} \cup L_1 \cup L_2 \cup \ldots \cup L_{n1})\) which can be constructed using \(\mathcal{A}(\varSigma ^n)\), \(\mathcal{A}(s)\), \(\mathcal{A}(\varSigma ^*)\), and the automata that accept \(L_i\).
We assume that constraint F is converted to DNF form where \(F \equiv \mathop {\vee }\nolimits _{i=1}^n F_i\), \(F_i \equiv \mathop {\wedge }\nolimits _{j=1}^m C_{ij}\), and each \(C_{ij}\) is either a basic constraint or negation of a basic constraint. The constraint F is pseudorelational if each \(F_i\) is pseudorelational.
 1.
Each variable \(v \in V_F\) appears in each \(C_i\) at most once.
 2.
There is only one variable, \(v \in V_F\), that appears in more than one constraint \(C_i\) where \(v \in V_{C_i} \ \wedge \ V_{C_i}>1\), and in each \(C_i\) that v appears in, v is on the left hand side of the constraint. We call v the projection variable.
 3.
For all variables \(v' \in V_F\) other than the projection variable, there is a single constraint \(C_i\) where \(v' \in V_{C_i} \ \wedge \ V_{C_i}>1\) and the projection variable v appears in \(C_i\), i.e., \(v \in V_{C_i}\).
 4.
For all constraints \(C_i\) where \(V_{C_i}>1\), \(C_i\) is not negated in the formula F.
Many string constraints extracted from programs via symbolic execution are pseudorelational constraints, or can be converted to pseudorelational constraints. The projection variable represents either the variable that holds the value of the user’s input to the program (for example, user input to a web application that needs to be validated), or the value of the string expression at a program sink. A program sink is a program point (such as a security sensitive function) for which it is necessary to compute the set of values that reach to that program point in order to check for vulnerabilities.
In order to construct the automaton \(\mathcal{A}(F,v)\) we first construct the automata \(\mathcal{A}(F_i, v)\) for each \(F_i\) where \(\mathcal{A}(F_i, v)\) accepts the language \([\![F_i,v ]\!]\). Then we combine the \(\mathcal{A}(F_i, v)\) automata using automata product such that \(\mathcal{A}(F,v)\) accepts the language \([\![F_1,v ]\!]\cup [\![F_2,v ]\!]\cup \ldots \cup [\![F_m,v ]\!]\).
Since we discussed how to handle disjunction, from now on we focus on constraints of the form \(F \equiv C_1 \ \wedge \ C_2 \ \wedge \ \ldots \ \wedge \ C_n\) where each \(C_i\) is either a basic constraint or negation of a basic constraint. For each \(C_i\), let \(V_{C_i}\) denote the set of variables that appear in \(C_i\). If \(V_{C_i}\) is a singleton set, then we refer to the variable in it as \(v_{C_i}\).
First, for each singlevariable constraint \(C_i\) that is not negated, we construct an automaton that accepts the truth set of the constraint \(C_i\), \([\![C_i, v_{C_i} ]\!]\), using the techniques we discussed above for singlevariable constraints. If \(C_i\) is negated, then we construct the automaton that accepts the complement language \(\varSigma ^*  [\![C_i, v_{C_i} ]\!]\) (note that, only singlevariable constraints can be negated in pseudorelational constraints). Let us call these automata \(\mathcal{A}(C_i, v_{C_i})\) (some of which may correspond to negated constraints).
Then, for any variable \(v' \in V_F\) that is not the projection variable, we construct an automaton \(\mathcal{A}(F, v')\) which accepts the intersection of the languages \(\mathcal{A}(C_i, v')\) for all singlevariable constraints that \(v'\) appears in, i.e., \(\mathcal{L}(\mathcal{A}(F, v')) = \bigcap _{V_{C_i} = \{v'\}} \mathcal{L}(\mathcal{A}(C_i, v'))\).

case \(C_i \equiv v = v'\): \(\mathcal{A}(C_i,v) = \mathcal{A}(F,v')\).

case \(C_i \equiv v = v_1 \ . \ v_2\): \(\mathcal{A}(C_i,v)\) is constructed using the automata \(\mathcal{A}(F,v_1)\) and \(\mathcal{A}(F,v_2)\) and it accepts the concatenation of the languages \(\mathcal{L}(\mathcal{A}(F, v_1))\) and \(\mathcal{L}(\mathcal{A}(F, v_2))\).

case \(C_i \equiv \textsc {len}(v) = \textsc {len}(v')\): Given the automaton \(\mathcal{A}(F,v')\), we construct an automaton \(A_{\textsc {len}(F,v')}\) such that \(s \in \mathcal{L}(A_{\textsc {len}(F,v')}) \Leftrightarrow \exists s' : \textsc {len}(s) = \textsc {len}(s') \ \wedge \ s' \in \mathcal{L}( \mathcal{A}(F,v'))\). Then, \(\mathcal{A}(C_i,v) = A_{\textsc {len}(F,v')}\).

case \(C_i \equiv \textsc {len}(v) < \textsc {len}(v')\): Given the automaton \(\mathcal{A}(F,v')\) we find the length of the maximum word accepted by \(\mathcal{A}(F,v')\), which is infinite if \(\mathcal{A}(F,v')\) has a loop that can reach an accepting state. If it is infinite then \(\mathcal{A}(C_i,v) = A(\varSigma ^*)\). If not, then given the maximum length m, \(\mathcal{A}(C_i,v)\) is the automaton that accepts the language \(\{ \varepsilon \} \cup \varSigma ^1 \cup \varSigma ^2 \cup \ldots \cup \varSigma ^{m1}\). Note that if \(m=0\) then \(\mathcal{A}(C_i,v) = A(\emptyset )\).

case \(C_i \equiv \textsc {len}(v) > \textsc {len}(v')\): Given the automaton \(\mathcal{A}(F,v')\) we find the length of the minimum word accepted by \(\mathcal{A}(F,v')\). Given the minimum length m, \(\mathcal{A}(C_i,v)\) is the automaton that accepts the concatenation of the languages accepted by \(\mathcal{A}(\varSigma ^{m+1})\) and \(\mathcal{A}(\varSigma ^*)\), i.e., \(\varSigma ^{m+1} \varSigma ^{*}\).

case \(C_i \equiv v = \textsc {replace}(v', s, s)\): Given the automaton \(\mathcal{A}(F,v')\) we use the construction presented in [38, 39] for language based replacement to construct the automaton \(\mathcal{A}(C_i,v)\).
The final step of the construction is to construct \(\mathcal{A}(F,v)\) using the automata \(\mathcal{A}(C_i,v)\) where \(\mathcal{L}(\mathcal{A}(F, v)) = \bigcap _{v \in V_{C_i}} \mathcal{L}(\mathcal{A}(C_i, v))\).
For pseudorelational constraints, the automaton \(\mathcal{A}(F, v))\) constructed based on the above construction accepts the truth set of the formula F for the projected variable, i.e., \(\mathcal{L}(\mathcal{A}(F, v)) = [\![F, v ]\!]\). However, the replace function has different variations in different programming languages (such as firstmatch versus longestmatch replace) and the match pattern can be given as a regular expression. The languagebased replace automata construction we use [38, 39] overapproximates the replace operation in some cases, which would then result in overapproximation of the truth set: \(\mathcal{L}(\mathcal{A}(F, v)) \supseteq [\![F, v ]\!]\).
Automata Construction for Relational Constraints: For constraints that are not pseudorelational, we extend the above algorithm to compute an over approximation of \([\![F, v ]\!]\). In relational constraints, more than one variable can be involved in multivariable constraints which creates a cycle in constraint evaluation.
In order to improve the efficiency of the above algorithm, we first build a constraint dependency graph where, 1) a multivariable constraint \(C_i\) depends on a single variable constraint \(C_j\) if \(V_{C_j} \subseteq V_{C_i}\), and 2) a multivariable constraint \(C_i\) depends on a multivariable constraint \(C_j\) if \(V_{C_j} \cap V_{C_i} \ne \emptyset \). We traverse the constraints based on their ordering in the dependency graph and iteratively refine the automata in case of cyclic dependencies. Note that, in the constructions we described above we only constructed automaton for the variable on the lefthandside of a relational constraint using the automata for the variables on the righthandside of the constraint. In the general case we need to construct automata for variables on the righthandside of the relational constraints too. We do this using techniques similar to the ones we described above. Constructing automata for the righthandside variables is equivalent to the preimage computations used during backward symbolic analysis as discussed in [35] and we use the constructions given there. Finally, unlike pseudorelational constraints, a relational constraint can contain negation of a basic constraint \(C_i\) where \(V_{C_i}>1\). In such cases, in constructing the truth set of \(\lnot C_i\) we can use the complement language \(\varSigma ^*  [\![C_i, v ]\!]\) only if \([\![C_i, v ]\!]\) is a singleton set. Otherwise, we construct an over approximation of the truth set of \(\lnot C_i\).
3 AutomataBased Model Counting
Once we have translated a set of constraints into an automaton we employ algebraic graph theory [5] and analytic combinatorics [14] to perform model counting. In our method, model counting corresponds exactly to counting the accepting paths of the constraint DFA up to a given length bound k. This problem can be solved using dynamic programming techniques in \(O(k \cdot \delta  )\) time where \(\delta \) is the DFA transition relation [11, 16]. However, for each different bound, the dynamic programming technique requires another traversal of the DFA graph.
A preferable solution is to derive a symbolic function that given a length bound k outputs the number of solutions within bound k. To achieve this, we use the transfer matrix method [14, 30] to produce an ordinary generating function which in turn yields a linear recurrence relation that is used to count constraint solutions. We will briefly review the necessary background and then describe the model counting algorithm.
Given a DFA A, consider its corresponding language \(\mathcal {L}\). Let \(\mathcal{L}_{i} = \{w \in \mathcal{L}: w = i\}\), the language of strings in \(\mathcal{L}\) with length i. Then \(\mathcal{L}= \bigcup _{i \ge 0} \mathcal{L}_{i}\). Define \(\mathcal{L}_i\) to be the cardinality of \(\mathcal{L}_{i}\). The cardinality of \(\mathcal{L}\) can be computed by the sum of a series \(a_0,a_1, \ldots , a_i, \ldots \) where each \(a_i\) is the cardinality of the corresponding language \(\mathcal{L}_{i}\), i.e., \(a_i = \mathcal{L}_{i}\).
4 Implementation
We implemented AutomataBased model Counter for string constraints (ABC) using the symbolic string analysis library provided by the Stranger tool [36, 38, 39]. We used the symbolic DFA representation of the MONA DFA library [8] to implement the constructions described in Sect. 2. In MONA’s DFA library, the transition relation of the DFA is represented as a Multiterminal Binary Decision Diagram (MBDD) which results in a compact representation of the transition relation. \(\textsc {ABC}\) supports more operations (such as \(\textsc {trim}\), \(\textsc {substring}\)) than the ones listed in Sect. 2 using constructions similar to the ones given in that section.
\(\textsc {ABC}\) supports the SMTLIB 2 language syntax. We specifically added support for CVC4 string operations [24]. In string constraint benchmarks provided by CVC4, boolean variables are used to assert the results of subformulas. In our automatabased constraint solver, we check the satisfiability of a formula by checking if its truth set is empty or not. We eliminated the boolean variables that are only used to check the results of string operations (such as string equivalence, string membership) and instead substituted the corresponding expressions directly. We converted ifthenelse structures into disjunctions. We also searched for several patterns between length equations and word equations to infer the values of the string variables whenever possible (for example when we see the constraint \(\textsc {len}(x)=0\) we can infer that the string variable x must be equal to the empty string). These transformations allow us to convert some constraints to pseudorelational constraints that we can precisely solve. If these transformations do not resolve all the cyclic dependencies in a constraint then the resulting DFA may recognize an overapproximation of all possible solutions.
We implemented the automatabased model counting algorithm of Sect. 3 by passing the automaton transfer matrix to Mathematica for computing the generating function, corresponding recurrence relation, and the model count for a specific bound. Because the DFAs we encountered in our experiments typically have sparse transition graphs, we make use of Mathematica’s powerful and efficient implementations of symbolic sparse matrix determinant functions [33].
5 Experiments
Constraint characteristics

Table 1 shows the frequency of string operations from our string constraint grammar that are contained in the ASE, Kaluza Small, and Kaluza Big benchmark sets. ASE benchmarks are from Java programs and represent serverside code [20]. The Kaluza benchmarks are taken from JavaScript programs and represent clientside code [28]. All three benchmarks have regular expression membership (\(\in \)), concatenation (.), string equality (\(=\)), and length constraints. However, the ASE benchmark contains additional string operations that are typically used for input sanitization, like \(\textsc {replace}\) and \(\textsc {substring}\).
Java Benchmarks. String constraints in these benchmarks are extracted from 7 realworld Java applications: Jericho HTML Parser, jxml2xql (an xmltosql converter), MathParser, MathQuizGame, Natural CLI (a natural language command line tool), Beasties (a command line game), HtmlCleaner, and iText (a PDF library) [20]. These benchmarks represent serverside code and employ many inputsanitizing string operators such as \(\textsc {replace}\) and \(\textsc {substring}\) as seen in Table 1. These string constraints were generated by extracting program path constraints through dynamic symbolic execution [20].
In [20], an empirical evaluation of several string constraint solvers is presented. As a part of this empirical evaluation, the authors use the symbolic string analysis library of Stranger [36, 38, 39] to construct automata for path constraints on strings. In order to evaluate the model counting component of \(\textsc {ABC}\), we ran their tool on the 7 benchmark sets and output the resulting automata whenever the constraint is satisfiable. Out of 116,164 string path constraints, 66,236 were found to be satisfiable and we performed model counting on those cases. The constraints in Java benchmarks are all singlevariable or pseudorelational constraints. The resulting automata do not have any overapproximation caused by relational constraints. As a measure of the size of the resulting automata, we give the number of BDD nodes used in the symbolic transition relation representation of MONA. The average number of BDD nodes for the satisfiable path constraints is 69.51 and the size of the each BDD node is 16 bytes. For these benchmarks our modelcounter is efficient; the average running time of model counting per path constraint is 0.0015 seconds and the resulting modelcounting recurrence is precise, i.e., gives the exact count for any given bound.
SMC and CVC4 are not able to handle the constraints in this data set since they do not support sanitization operations such as \(\textsc {replace}\).
Log scaled comparison between SMC and ABC
bound  SMC lower bound  SMC upper bound  ABC count  

nullhttpd  500  3752  3760  3760 
ghttpd  620  4880  4896  4896 
csplit  629  4852  4921  4921 
grep  629  4676  4763  4763 
wc  629  4281  4284  4281 
obscure  6  0  3  2 
Constraintsolver comparison
ABC  CVC4  ABC  CVC4  ABC  CVC4  ABC  CVC4  ABC  CVC4  

sat  sat  unsatunsat  satunsat  unsatsat  sattimeout  
sat/small  19728  3  0  0  0  
sat/big  1587  0  0  0  0  
unsat/small  8139  3013  74  0  0  
unsat/big  3419  5904  2385  0  2359 
Satisfiability Checking Evaluation. We ran \(\textsc {ABC}\) on SMTLIB 2 translation of the full set of JavaScript benchmarks. We put a 20second CPU timeout limit on \(\textsc {ABC}\) for each benchmark constraint. Table 3 shows the comparison between \(\textsc {ABC}\) and the CVC4 [24] constraint solver based on the CVC4 results that are available online. The first column shows the initial satisfiability classification of the data set by the creators of the benchmarks [28]. The next two columns show the number of results that \(\textsc {ABC}\) and CVC4 agree. The last three columns show the cases where \(\textsc {ABC}\) and CVC4 differ. Note that, since \(\textsc {ABC}\) overapproximates the solution set, if the given constraint is not singlevalued or pseudorelational, it is possible for \(\textsc {ABC}\) to classify a constraint as satisfiable even if it is unsatisfiable. However, it is not possible for \(\textsc {ABC}\) to classify a constraint unsatisfiable if it is satisfiable. Out of 47,284 benchmark constraints \(\textsc {ABC}\) and CVC4 agree on 41,793 of them. As expected \(\textsc {ABC}\) never classifies a constraint as unsatisfiable if CVC4 classifies it as satisfiable. However, due to overapproximation of relational constraints, \(\textsc {ABC}\) classifies 2,459 constraints as satisfiable although CVC4 classifies them as unsatisfiable. A practical approach would be to use \(\textsc {ABC}\) together with a satisfiability solver like CVC4, and, given a constraint, first use the satisfiability solver to determine the satisfiability of the formula, and then use \(\textsc {ABC}\) to generate its truth set and the model counting function.
The average automata construction time for big benchmark constraints is 0.44 seconds and for small benchmark constraints it is 0.01 seconds. CVC4 average running times are 0.18 seconds and 0.015 seconds respectively (excluding timeouts). CVC4 times out for 2359 constraints, whereas \(\textsc {ABC}\) never times out. For those 2359 constraints, \(\textsc {ABC}\) reports satisfiable. \(\textsc {ABC}\) is unable to handle 672 constraints; the automata package we use (MONA) is unable to handle the resulting automata and we believe that these cases can be solved by modifying MONA. For these 672 constraints; CVC4 times out for 29 of them, reports unsat for 246 of them, and reports sat for 397 of them. There are also a few thousand constraints from the Kaluza benchmarks that CVC4 is unable to handle.
6 Conclusions and Future Work
We presented a modelcounting string constraint solver that, given a constraint, generates: (1) An automaton that accepts all solutions to the given string constraint; (2) A modelcounting function that, given a length bound, returns the number of solutions within that bound. Our experiments on thousands of constraints extracted from realworld web applications demonstrates the effectiveness and efficiency of the proposed approach. Our string constraint solver can be used in quantitative information flow, probabilistic analysis and automated repair synthesis. We plan to extend our automatabased modelcounting approach to Presburger arithmetic constraints using an automatabased representation for Presburger arithmetic constraints [4, 34].
Footnotes
 1.
Results of our experiments are available at http://www.cs.ucsb.edu/~vlab/ABC/.
References
 1.Abdulla, P.A., Atig, M.F., Chen, Y.F., Holík, L., Rezine, A., Rümmer, P., Stenman, J.: String constraints for verification. In: Biere, A., Bloem, R. (eds.) CAV 2014. LNCS, vol. 8559, pp. 150–166. Springer, Heidelberg (2014) Google Scholar
 2.Alkhalaf, M., Aydin, A., Bultan, T.: Semantic differential repair for input validation and sanitization. In: Proceedings of the International Symposium on Software Testing and Analysis (ISSTA), pp. 225–236 (2014)Google Scholar
 3.Alkhalaf, M., Bultan, T., Gallegos, J.L.: Verifying clientside input validation functions using string analysis. In: Proceedings of the 34th International Conference on Software Engineering (ICSE), pp. 947–957 (2012)Google Scholar
 4.Bartzis, C., Bultan, T.: Efficient symbolic representations for arithmetic constraints in verification. Int. J. Found. Comput. Sci. 14(4), 605–624 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
 5.Biggs, N.: Algebraic Graph Theory. Cambridge University Press, Cambridge Mathematical Library, Cambridge (1993) Google Scholar
 6.Bjørner, N., Tillmann, N., Voronkov, A.: Path feasibility analysis for stringmanipulating programs. In: Kowalewski, S., Philippou, A. (eds.) TACAS 2009. LNCS, vol. 5505, pp. 307–321. Springer, Heidelberg (2009) CrossRefGoogle Scholar
 7.Borges, M., Filieri, A., d’Amorim, M., Pasareanu, C.S., Visser, W.: Compositional solution space quantification for probabilistic software analysis. In: Proceedigns of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) (2014)Google Scholar
 8.BRICS. The MONA project. http://www.brics.dk/mona/
 9.Christensen, A.S., Møller, A., Schwartzbach, M.I.: Precise analysis of string expressions. In: Proceedings of the 10th International Static Analysis Symposium (SAS), pp. 1–18 (2003)Google Scholar
 10.Clark, D., Hunt, S., Malacaria, P.: A static analysis for quantifying information flow in a simple imperative language. J. Comput. Secur. 15(3), 321–371 (2007)Google Scholar
 11.Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGrawHill Higher Education, Boston (2001)zbMATHGoogle Scholar
 12.D’Antoni, L., Veanes, M.: Static analysis of string encoders and decoders. In: Giacobazzi, R., Berdine, J., Mastroeni, I. (eds.) VMCAI 2013. LNCS, vol. 7737, pp. 209–228. Springer, Heidelberg (2013) CrossRefGoogle Scholar
 13.Filieri, A., Pasareanu, C.S., Visser, W.: Reliability analysis in symbolic pathfinder. In: Proceedings of the 35th International Conference on Software Engineering (ICSE), pp. 622–631 (2013)Google Scholar
 14.Flajolet, P., Sedgewick, R.: Analytic Combinatorics, 1st edn. Cambridge University Press, New York (2009) CrossRefzbMATHGoogle Scholar
 15.Ganesh, V., Minnes, M., SolarLezama, A., Rinard, M.: Word equations with length constraints: what’s decidable? In: Biere, A., Nahir, A., Vos, T. (eds.) HVC. LNCS, vol. 7857, pp. 209–226. Springer, Heidelberg (2013) CrossRefGoogle Scholar
 16.Gross, J.L., Yellen, J., Zhang, P.: Handbook of Graph Theory, 2nd edn. Chapman and Hall/CRC, Boca Raton (2013)Google Scholar
 17.Hooimeijer, P., Livshits, B., Molnar, D., Saxena, P., Veanes, M.: Fast and precise sanitizer analysis with bek. In: Proceedings of the 20th USENIX Conference on Security (2011)Google Scholar
 18.Hooimeijer, P., Weimer, W.: A decision procedure for subset constraints over regular languages. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 188–198 (2009)Google Scholar
 19.Hooimeijer, P., Weimer, W.: Solving string constraints lazily. In: Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 377–386 (2010)Google Scholar
 20.Kausler, S., Sherman, E.: Evaluation of string constraint solvers in the context of symbolic execution. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering (ASE), pp. 259–270 (2014)Google Scholar
 21.Kiezun, A., Ganesh, V., Guo, P.J., Hooimeijer, P., Ernst, M.D.: Hampi: a solver for string constraints. In: Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA), pp. 105–116 (2009)Google Scholar
 22.Knuth, D.E.: The Art of Computer Programming, Volume 1: Fundamental Algorithms. AddisonWesley, Reading (1968) Google Scholar
 23.Li, G., Ghosh, I.: PASS: string solving with parameterized array and interval automaton. In: Bertacco, V., Legay, A. (eds.) HVC 2013. LNCS, vol. 8244, pp. 15–31. Springer, Heidelberg (2013) CrossRefGoogle Scholar
 24.Liang, T., Reynolds, A., Tinelli, C., Barrett, C., Deters, M.: A DPLL(T) Theory solver for a theory of strings and regular expressions. In: Biere, A., Bloem, R. (eds.) CAV 2014. LNCS, vol. 8559, pp. 646–662. Springer, Heidelberg (2014) Google Scholar
 25.Luu, L., Shinde, S., Saxena, P., Demsky, B.: A model counter for constraints over unbounded strings. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), p. 57 (2014)Google Scholar
 26.McCamant, S., Ernst, M.D.: Quantitative information flow as network flow capacity. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 193–205 (2008)Google Scholar
 27.Phan, Q.S., Malacaria, P., Tkachuk, O., Păsăreanu, C.S.: Symbolic quantitative information flow. SIGSOFT Softw. Eng. Notes 37(6), 1–5 (2012)CrossRefGoogle Scholar
 28.Saxena, P., Akhawe, D., Hanna, S., Mao, F., McCamant, S., Song, D.: A symbolic execution framework for javascript. In: Proceedings of the 31st IEEE Symposium on Security and Privacy (2010)Google Scholar
 29.Smith, G.: On the foundations of quantitative information flow. In: de Alfaro, L. (ed.) FOSSACS 2009. LNCS, vol. 5504, pp. 288–302. Springer, Heidelberg (2009) CrossRefGoogle Scholar
 30.Stanley, R.P.: Enumerative Combinatorics: vol. 1, 2nd edn. Cambridge University Press, New York (2011) CrossRefGoogle Scholar
 31.Tateishi, T., Pistoia, M., Tripp, O.: Path and indexsensitive string analysis based on monadic secondorder logic. In: Proceedings of the International Symposium on Software Testing and Analysis (ISSTA), pp. 166–176 (2011)Google Scholar
 32.Trinh, M.T., Chu, D.H., Jaffar, J.: S3: a symbolic string solver for vulnerability detection in web applications. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 1232–1243 (2014)Google Scholar
 33.Wolfram Research Inc., Mathematica (2014). http://www.wolfram.com/mathematica/
 34.Wolper, P., Boigelot, B.: On the construction of automata from linear arithmetic constraints. In: Graf, S. (ed.) TACAS 2000. LNCS, vol. 1785, pp. 1–19. Springer, Heidelberg (2000) CrossRefGoogle Scholar
 35.Yu, F.: Automatic verification of string manipulating programs. Ph.D. thesis. University of California, Santa Barbara (2010)Google Scholar
 36.Yu, F., Alkhalaf, M., Bultan, T.: Stranger: an automatabased string analysis tool for PHP. In: Esparza, J., Majumdar, R. (eds.) TACAS 2010. LNCS, vol. 6015, pp. 154–157. Springer, Heidelberg (2010) CrossRefGoogle Scholar
 37.Yu, F., Alkhalaf, M., Bultan, T.: Patching vulnerabilities with sanitization synthesis. In: Proceedings of the 33rd International Conference on Software Engineering (ICSE), pp. 131–134 (2011)Google Scholar
 38.Fang, Y., Alkhalaf, M., Bultan, T., Ibarra, O.H.: Automatabased symbolic string analysis for vulnerability detection. Formal Methods Syst. Des. 44(1), 44–70 (2014)CrossRefzbMATHGoogle Scholar
 39.Yu, F., Bultan, T., Cova, M., Ibarra, O.H.: Symbolic string verification: an automatabased approach. In: Havelund, K., Majumdar, R. (eds.) SPIN 2008. LNCS, vol. 5156, pp. 306–324. Springer, Heidelberg (2008) CrossRefGoogle Scholar
 40.Zheng, Y., Zhang, X., Ganesh, V.: Z3str: a z3based string solver for web application analysis. In: Proceedings of the 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE), pp. 114–124 (2013)Google Scholar