Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

figure a

1 Introduction

Analyzing string manipulating code is of great importance because string manipulation is ubiquitous in modern software systems, such as web applications and database services. String analysis aims to determine the set of assignments to the string variables in string expressions that may arise from program execution or other sources. It can be applied, e.g., to identify security vulnerabilities by checking if a security sensitive function can receive an input string that contains an exploit [24, 29, 32], to identify behaviors of JavaScript code that use the eval function by computing the string values that can reach the eval function [15], to identify html generation errors by computing the html code generated by web applications [20], to identify the set of queries that are sent to back-end database by analyzing the code that generates the SQL queries [12], and to patch input validation and sanitization functions by automatically synthesizing repairs [31].

Prior string analysis methods are mainly automata-based or satisfiability-based. For automata-based approaches, explicit state-graph represented finite automata [8, 13], MTBDD represented finite automata [2, 32], and Boolean algebra represented symbolic finite automata [10, 27, 28] have been proposed. By characterizing a set of strings as a language, these methods are not restricted to particular bounds on string lengths. They can be used to synthesize filters or sanitizers [31] to screen out malicious string inputs to systems under protection, but have difficulty in generating counterexamples at system inputs to witness vulnerability. For satisfiability-based approaches, bit-vector based bounded checking [4, 18, 19, 23] and satisfiability modulo theories (SMT) based constraint solving [1, 3, 26, 34] have been proposed. They may answer a certain set of string queries with length constraints not doable for automata-based methods. By searching a solution to a given set of string constraints, they can generate counterexamples to witness vulnerability, but cannot support the synthesis of string filters amenable for firmware or hardware implementation for real-time screening of malicious inputs to a system under protection.

In this paper, we intend to support string analysis of acyclic constraints with both counterexample generation and filter synthesis capabilities. To achieve this goal, we develop a nondeterministic finite automata (NFA) manipulation engine with logic circuit representation. In particular, we adopt the and-inverter graph (AIG) [21], which have been widely adopted in logic synthesis for industrial applications in electronic design automation (EDA) in recent years, as the underlying data structure. Thereby automata manipulations can be performed implicitly using logic circuits while determinization is largely avoided. Our method is scalable to automata with large alphabet sizes in contrast to BDD-based automata representation. We further extend our method to represent symbolic finite automata [10], which may have infinite (or very large) alphabets [25]. Our method enables the generation of counterexamples for backtracking attack input strings to a vulnerable application and the synthesis of filters amenable for firmware or hardware implementation to avoid exploits of vulnerabilities in real time. The proposed method is implemented as a new string analysis tool, named SLOG. We conduct comprehensive experimental study on over 20000 string constraints generated from real web applications to compare state-of-the-art tools, including JSA [8], Stranger [30], Z3-str2 [34], CVC4 [3], and Norn [1]. Experiments suggest the performance advantage of SLOG in contrast to other string solvers with counterexample generation capabilities. Moreover, the scalability of SLOG is shown for automata with large alphabets in contrast to BDD-based methods of automata representation.

2 Preliminaries

A finite automaton A is a five-tuple \((Q, \varSigma , I, T, O)\), where Q is a finite state set, \(\varSigma \) is an alphabet, \(I \subseteq Q\) is a set of initial states, \(T \subseteq \varSigma \times Q \times Q\) is a state transition relation, and \(O \subseteq Q\) is a set of accepting states. In the sequel, we shall instead represent the initial states, transition relation, and accepting states in terms of characteristic functions \(I: Q \rightarrow \mathbb {B}\), \(T: \varSigma \times Q \times Q \rightarrow \mathbb {B}\), and \(O: Q \rightarrow \mathbb {B}\), respectively. (A characteristic function \(\chi \) represents a (Boolean encoded) set S by having \(\chi (e) = 1\) (True) if \(e \in S\) and \(\chi (e) = 0\) (False) if \(e \not \in S\).) A finite automaton can be either a deterministic finite automaton (DFA) or a nondeterministic finite automaton (NFA) depending on the determinicity of its state transition. In the sequel, we refer \({{\varvec{x}}}\), \({{\varvec{s}}}\) and \({{\varvec{s}}}'\) to the input, current-state and next-state variables in the Boolean domain, and relate the valuations of variables \({{\varvec{x}}}\), denoted \([\![ {{\varvec{x}}} ]\!]\), and the valuations of variables \({{\varvec{s}}}\), denoted \([\![ {{\varvec{s}}} ]\!]\), to the alphabet \(\varSigma \) and state set Q, respectively. A trace of an automaton is a state-input alternating sequence \(q_1\), \(\sigma _1\), \(q_2\), \(\sigma _2\), ..., \(q_\ell \), which satisfies \(T(\sigma _i, q_i, q_{i+1})\) for \(i=1, \ldots , \ell -1\).

A (finite) string \(\sigma _1, \ldots , \sigma _n\), for \(n \ge 0\) (an empty string, denoted \(\epsilon \), when \(n=0\)), over alphabet \(\varSigma \) is accepted by an automaton if there exist states \(q_1, \ldots , q_{n+1}\) such that \(I(q_1)=1\) (for \(q_1\) being an initial state), \(O(q_{n+1})=1\) (for \(q_{n+1}\) being an accepting state), and the sequence \(q_1, \sigma _1, q_2, \sigma _2, \ldots , q_{n+1}\) forms a trace. The set of strings accepted by an automaton A is called the (regular) language accepted by A, denoted as \(\mathcal {L}(A)\).

Because a finite automaton \(A = (Q, \varSigma , I, T, O)\) can be fully described by the characteristic functions of I, T, and O, with Boolean encoding on Q and \(\varSigma \) the automaton A can be represented as a logic circuit, denoted \(\mathcal {C}(A)\), that realizes these characteristic functions. In the sequel, we shall not distinguish between characteristic functions ITO and their circuit representations. In this work, we show how various string and automata manipulations can be achieved under the logic circuit representation of (nondeterministic) finite automata. For practical implementation, we exploit the and-inverter graph (AIG) [21] as the underlying data structure for scalable logic circuit representation and manipulation. An AIG is a directed acyclic graph G(VE), where each vertex \(v \in V\) is either a primary input node without any fanin or a function node representing a two-input and gate, and each edge \((u,v) \in E\) denotes a complemented or uncomplemented connection from vertex u to v. Due to its simplicity, the AIG has been efficiently implemented as a Boolean reasoning engine widely used in various logic synthesis and formal verification tasks in industrial very-large-scale integration (VLSI) designs.

In the sequel, we assume a finite automaton can be nondeterministic and may even take \(\epsilon \)-transitions. To represent an \(\epsilon \)-transition under the circuit representation, we reserve a symbol “\(\epsilon \)” as an addendum to \(\varSigma \) with a special handling. Given a state transition relation T, we denote its equivalent variant with an \(\epsilon \) self-transition inserted for each state as \(T^\epsilon \). That is, \(T^\epsilon ({{\varvec{x}}},{{\varvec{s}}},{{\varvec{s}}}')\) represents \(T({{\varvec{x}}},{{\varvec{s}}},{{\varvec{s}}}') \vee (({{\varvec{s}}}={{\varvec{s}}}') \wedge ({{\varvec{x}}} = \epsilon ))\).

Given a web application and an attack pattern (specified as a regular expression) we can first extract dependency graphs for security sensitive functions, called the sinks, from the web application using static program analysis techniques [14, 17]. Each extracted dependency graph shows how the input values flow to a sink, including all the string operations performed on the input values before they reach the sink. A dependency graph is vulnerable if its sink node accepts an attack string (with respect to a given attack pattern). From the dependency graph, we can generate string constraint formulas and check whether the intersection of the sink node’s language and the attack pattern is empty. If it is empty, then the web application is not vulnerable. Otherwise, a counterexample witnessing the vulnerability is to be computed.

3 String and Automata Operations

We show that string/language operations, including intersection, union, concatenation, deletion, replacement, and emptiness checking, can be achieved under logic circuit representation. We omit other less used operations, including reversion, prefix, suffix, and substring, due to space limitation.

In the following exposition we assume an automaton A (or \(A_i\)) is represented as a circuit of its characteristic functions \(T({{\varvec{x}}},{{\varvec{s}}},{{\varvec{s}}}')\), \(I({{\varvec{s}}})\), and \(O({{\varvec{s}}})\) (or \(T_i({{\varvec{x}}},{{\varvec{s}}}_i,{{\varvec{s}}}_i')\), \(I_i({{\varvec{s}}}_i)\), and \(O_i({{\varvec{s}}}_i)\) for i = 1, 2, 3). Also we assume without loss of generality that \(|{{\varvec{s}}}_1| = m\) and \(|{{\varvec{s}}}_2| = n\) for automata \(A_1\) and \(A_2\), respectively, with \(m \le n\) in our following discussion unless otherwise said.

Fig. 1.
figure 1

Circuit construction of (a) Int, (b) Uni, (c) Cat, (d) \(\textsc {Del}_\xi \), and (e) IsEmp operations.

3.1 Intersection

Given two automata \(A_1\) and \(A_2\), the automaton \(A_\textsc {Int} = \textsc {Int}(A_1, A_2)\) that accepts language \(\mathcal {L}(A_\textsc {Int}) = \mathcal {L}(A_1) \cap \mathcal {L}(A_2)\) is the product machine with the characteristic functions \(T_\textsc {Int}\), \(I_\textsc {Int}\), \(O_\textsc {Int}\) constructed by first augmenting the transition relations \(T_1\) and \(T_2\) to \(T_1^\epsilon \) and \(T_2^\epsilon \), respectively, by inserting an \(\epsilon \) self-transition for each state, and second conjuncting the resultant characteristic functions of \(A_1\) and \(A_2\). Accordingly, we have

figure b

with

$$\begin{aligned} T_\textsc {Int}({{\varvec{x}}}, {{\varvec{s}}}, {{\varvec{s}}}')= & {} T_1^\epsilon ({{\varvec{x}}},{{\varvec{s}}}_1,{{\varvec{s}}}_1') \wedge T_2^\epsilon ({{\varvec{x}}},{{\varvec{s}}}_2,{{\varvec{s}}}_2'), \\ I_\textsc {Int}({{\varvec{s}}})= & {} I_1({{\varvec{s}}}_1)\wedge I_2({{\varvec{s}}}_2), \\ O_\textsc {Int}({{\varvec{s}}})= & {} O_1({{\varvec{s}}}_1)\wedge O_2({{\varvec{s}}}_2), \end{aligned}$$

for \({{\varvec{s}}} = ({{\varvec{s}}}_1,{{\varvec{s}}}_2)\). The corresponding circuit construction is illustrated in Fig. 1(a). The constructed circuit is of size \(O(|\mathcal {C}(A_1)|+|\mathcal {C}(A_2)|\)) and has \((|{{\varvec{s}}}_1|+|{{\varvec{s}}}_2|)\) state variables.

3.2 Union

Given two automata \(A_1\) and \(A_2\), the automaton \(A_\textsc {Uni} = \textsc {Uni}(A_1,A_2)\) that accepts language \(\mathcal {L}(A_\textsc {Uni}) = \mathcal {L}(A_1) \cup \mathcal {L}(A_2)\) can be constructed by disjointly unioning the two with state variables being merged and states being distinguished by an auxiliary variable \(\alpha \), similar to the multiplexed machine in [16], as follows.

figure c

with

$$\begin{aligned} T_\textsc {Uni}({{\varvec{x}}}, {{\varvec{s}}}, {{\varvec{s}}}')= & {} (\lnot \alpha \wedge \lnot \alpha ' \wedge T_1({{\varvec{x}}}, \langle {{\varvec{s}}}_2 \rangle _m, \langle {{\varvec{s}}}_2'\rangle _m)) \vee (\alpha \wedge \alpha ' \wedge T_2({{\varvec{x}}}, {{\varvec{s}}}_2, {{\varvec{s}}}_2')), \\ I_\textsc {Uni}({{\varvec{s}}})= & {} (\lnot \alpha \wedge I_1(\langle {{\varvec{s}}}_2 \rangle _m)) \vee (\alpha \wedge I_2({{\varvec{s}}}_2)), \\ O_\textsc {Uni}({{\varvec{s}}})= & {} (\lnot \alpha \wedge O_1(\langle {{\varvec{s}}}_2 \rangle _m)) \vee (\alpha \wedge O_2({{\varvec{s}}}_2)), \end{aligned}$$

where \({{\varvec{s}}} = ({{\varvec{s}}}_2,\alpha )\) and the bracket “\(\langle {{\varvec{s}}}_2 \rangle _m\)” indicates taking a subset of the first m variables of \({{\varvec{s}}}_2\). Essentially the state variables \({{\varvec{s}}}\) of \(A_1\) is merged into \({{\varvec{s}}}_2\) so that the first m variables of \({{\varvec{s}}}_2\) are shared by both \(A_1\) and \(A_2\). Moreover, the \(\alpha \) bit signifies the states of \(A_1\) by \(\alpha = 0\), and signifies the states of \(A_2\) by \(\alpha = 1\). That is, a state \(q \in [\![ {{\varvec{s}}} ]\!]\) belongs to \(A_1\) if its variable \(\alpha \) valuates to 0, and to \(A_2\) if \(\alpha \) valuates to 1. The corresponding circuit construction is illustrated in Fig. 1(b). The constructed circuit is of size \(O(|\mathcal {C}(A_1)|+|\mathcal {C}(A_2)|\)) and has \((\max \{|{{\varvec{s}}}_1|,|{{\varvec{s}}}_2|\}+1)\) state variables.

3.3 Concatenation

Given two automata \(A_1\) and \(A_2\), the automaton \(A_\textsc {Cat} = \textsc {Cat}(A_1,A_2)\) that accepts language \(\mathcal {L}(A_\textsc {Cat}) = (\mathcal {L}(A_1).\mathcal {L}(A_2))\), which contains the set of concatenated strings \(\varvec{\sigma }_1.\varvec{\sigma }_2\) for \(\varvec{\sigma }_1 \in \mathcal {L}(A_1)\) and \(\varvec{\sigma }_2 \in \mathcal {L}(A_2)\), can be constructed, in a way similar to Uni, as follows.

figure d

with

$$\begin{aligned} T_\textsc {Cat}({{\varvec{x}}}, {{\varvec{s}}}, {{\varvec{s}}}')= & {} (\lnot \alpha \wedge \lnot \alpha ' \wedge T_1({{\varvec{x}}}, \langle {{\varvec{s}}}_2 \rangle _m, \langle {{\varvec{s}}}_2'\rangle _m)) \vee (\alpha \wedge \alpha ' \wedge T_2({{\varvec{x}}}, {{\varvec{s}}}_2, {{\varvec{s}}}_2')) \vee \\&(({{\varvec{x}}}= \epsilon ) \wedge \lnot \alpha \wedge \alpha ' \wedge O_1(\langle {{\varvec{s}}}_2 \rangle _m) \wedge I_2({{\varvec{s}}}_2')), \\ I_\textsc {Cat}({{\varvec{s}}})= & {} \lnot \alpha \wedge I_1(\langle {{\varvec{s}}}_2 \rangle _m), \\ O_\textsc {Cat}({{\varvec{s}}})= & {} \alpha \wedge O_2({{\varvec{s}}}_2), \end{aligned}$$

for \({{\varvec{s}}} = ({{\varvec{s}}}_2,\alpha )\). The corresponding circuit construction is illustrated in Fig. 1(c). The constructed circuit is of size \(O(|\mathcal {C}(A_1)|+|\mathcal {C}(A_2)|\)) and has \((\max \{|{{\varvec{s}}}_1|,|{{\varvec{s}}}_2|\}+1)\) state variables.

3.4 Deletion

Given an automaton A, the automaton \(A_{\textsc {Del}_\xi } = \textsc {Del}(A, \xi )\) for \(\xi \in \varSigma \) that accepts the strings of \(\varvec{\sigma } \in \mathcal {L}(A)\) but with each appearance of symbol “\(\xi \)” in \(\varvec{\sigma }\) being removed can be constructed as follows.

figure e

with

$$\begin{aligned} T_{\textsc {Del}_\xi }({{\varvec{x}}}, {{\varvec{s}}}, {{\varvec{s}}}')= & {} (T({{\varvec{x}}}, {{\varvec{s}}}, {{\varvec{s}}}') \vee (({{\varvec{x}}} = \epsilon ) \wedge T(\xi , {{\varvec{s}}}, {{\varvec{s}}}'))) \wedge ({{\varvec{x}}} \ne \xi ), \\ I_{\textsc {Del}_\xi }({{\varvec{s}}})= & {} I({{\varvec{s}}}),\\ O_{\textsc {Del}_\xi }({{\varvec{s}}})= & {} O({{\varvec{s}}}), \end{aligned}$$

(The deletion operation is a special case of the replacement operation by replacing an alphabet symbol with \(\epsilon \).) The corresponding circuit construction is illustrated in Fig. 1(d). The constructed circuit is of size \(O(|\mathcal {C}(A)|\)) and has \(|{{\varvec{s}}}|\) state variables.

3.5 Replacement

Given three automata \(A_1\), \(A_2\), \(A_3\), we study how to construct the automata \(A_\textsc {Rep} = \textsc {Rep}(A_1, A_2, A_3)\) that accepts the language \(\{(\varvec{\sigma }_{\mathbf{1}}.\varvec{\tau }_1.\varvec{\sigma }_2.\varvec{\tau }_2\ldots )\) \(\in \varSigma ^*\ |\ (\varvec{\sigma }_{\mathbf{1}}. \varvec{\rho }_1.\varvec{\sigma }_2.\varvec{\rho }_2\ldots )\) \(\in \mathcal {L}(A_1),\) \(\varvec{\sigma }_{{\varvec{i}}} \not \in (\varSigma ^*.\mathcal {L}(A_2).\varSigma ^*)\), \(\varvec{\rho }_i \in \mathcal {L}(A_2)\) and \(\varvec{\tau }_i \in \mathcal {L}(A_3)\) for all \(i \}\), that is, replacing \(\mathcal {L}(A_2)\) with \(\mathcal {L}(A_3)\) in \(\mathcal {L}(A_1)\). Based upon [32], we construct the automata \(A_\textsc {Rep}\) as follows.

figure f

First, we build automaton \(A_1^{\triangleleft \triangleright }\), which parenthesizes any substrings of a string in \(\mathcal {L}(A_1)\) by two fresh new symbols “\(\lhd \)” and “\(\rhd \)”. It yields from \(A_1\) the automaton \(A_1^{\triangleleft \triangleright }\) with

$$\begin{aligned} T_{1}^{\triangleleft \triangleright }= & {} ((\alpha = \alpha ') \wedge ({{\varvec{x}}} \ne \lhd ) \wedge ({{\varvec{x}}} \ne \rhd ) \wedge T_1({{\varvec{x}}}, {{\varvec{s}}}_1, {{\varvec{s}}}_1')) \vee \\&( ({{\varvec{s}}}_1 = {{\varvec{s}}}_1') \wedge ((\lnot \alpha \wedge \alpha ' \wedge ({{\varvec{x}}} = \lhd )) \vee (\alpha \wedge \lnot \alpha ' \wedge ({{\varvec{x}}} = \rhd )) )), \\ I_{1}^{\triangleleft \triangleright }= & {} \lnot \alpha \wedge I_1({{\varvec{s}}}_1), \\ O_{1}^{\triangleleft \triangleright }= & {} \lnot \alpha \wedge O_1({{\varvec{s}}}_1). \end{aligned}$$

The above construction makes two copies of the state space distinguished by variable \(\alpha \). When the input symbol is not equal to \(\lhd \) or \(\rhd \), the state transition is the same as \(A_1\). When the input symbol equals \(\lhd \) (resp. \(\lhd \)), the state in the \(\alpha =0\) (resp. \(\alpha =1\)) space transitions to its counterpart in the \(\alpha =1\) (resp. \(\alpha =0\)) space.

Second, we build automaton \(A_{4}\), which is the automaton that accepts the strings \(\{ (\varvec{\sigma }_1.\lhd . \varvec{\rho }_1. \rhd . \varvec{\sigma }_2. \lhd . \varvec{\rho }_2. \rhd \ldots )\) \(\in \varSigma ^*\ |\ \) \(\varvec{\sigma }_i \in \overline{\varSigma ^*.\mathcal {L}(A_2).\varSigma ^*}\) and \(\varvec{\rho }_i \in \mathcal {L}(A_2) \}\). Let \(A_h\) be the automaton that accepts the language \(\overline{\varSigma ^*.\mathcal {L}(A_2).\varSigma ^*}\) with characteristic functions \(T_h({{\varvec{x}}}, {{\varvec{s}}}_h, {{\varvec{s}}}_h')\), \(I_h({{\varvec{s}}}_h)\), \(O_h({{\varvec{s}}}_h)\). Notice that constructing the automaton \(A_h\) requires complementing an NFA and is of exponential cost. Fortunately in most applications the automaton \(A_2\) is known a priori and thus can be precomputed. Given \(A_2\) and \(A_h\), assuming without loss of generality \(|{{\varvec{s}}}_h|=n \ge |{{\varvec{s}}}_2| = m\), automata \(A_{4}\) can be derived as follows.

$$\begin{aligned} T_4= & {} (\lnot \beta \wedge \lnot \beta ' \wedge ({{\varvec{x}}} \ne \lhd ) \wedge ({{\varvec{x}}} \ne \rhd ) \wedge T_h({{\varvec{x}}}, {{\varvec{s}}}_h, {{\varvec{s}}}_h')) \vee \\&(\beta \wedge \beta ' \wedge ({{\varvec{x}}} \ne \lhd ) \wedge ({{\varvec{x}}} \ne \rhd ) \wedge T_2({{\varvec{x}}}, \langle {{\varvec{s}}}_h\rangle _m, \langle {{\varvec{s}}}_h'\rangle _m)) \vee \\&(\lnot \beta \wedge \beta ' \wedge ({{\varvec{x}}} = \lhd ) \wedge O_h({{\varvec{s}}}_h) \wedge I_2(\langle {{\varvec{s}}}_h'\rangle _m)) \vee \\&(\beta \wedge \lnot \beta ' \wedge ({{\varvec{x}}} = \rhd ) \wedge O_2(\langle {{\varvec{s}}}_h\rangle _m) \wedge I_h({{\varvec{s}}}_h')), \\ I_4= & {} \lnot \beta \wedge I_h({{\varvec{s}}}_h), \\ O_4= & {} \lnot \beta \wedge (O_h({{\varvec{s}}}_h) \vee I_h({{\varvec{s}}}_h)). \end{aligned}$$

Third, let \(A_5 = \textsc {Int}(A_1^{\triangleleft \triangleright }, A_4)\) with characteristic functions \(T_5({{\varvec{x}}}, {{\varvec{s}}}_5, {{\varvec{s}}}_5')\), \(I_5({{\varvec{s}}}_5)\), \(O_5({{\varvec{s}}}_5)\), where \({{\varvec{s}}}_5 = ({{\varvec{s}}}_1, \alpha , {{\varvec{s}}}_4)\) with \({{\varvec{s}}}_4 = ({{\varvec{s}}}_h, \beta )\). Hence \(A_5\) accepts the strings in \(\mathcal {L}(A_1)\) with all the substrings in \(\mathcal {L}(A_2)\) being marked. Then, in \(\mathcal {L}(A_5)\) instead of replacing substrings \(\lhd \mathcal {L}(A_2) \rhd \) with strings in \(\mathcal {L}(A_3)\), we replace \(\lhd \) with \(\mathcal {L}(A_3)\), \(\rhd \) with \(\epsilon \), and \(\mathcal {L}(A_2)\) with \(\epsilon \). We obtain

$$\begin{aligned} T_\textsc {Rep}({{\varvec{x}}}, {{\varvec{s}}}, {{\varvec{s}}}')= & {} (\lnot \alpha \wedge \lnot \alpha ' \wedge T_5({{\varvec{x}}}, {{\varvec{s}}}_5, {{\varvec{s}}}_5') \wedge \lnot \gamma \wedge \lnot \gamma ' \wedge I_3({{\varvec{s}}}_3) \wedge I_3({{\varvec{s}}}_3')) \vee \\&(\lnot \alpha \wedge \lnot \alpha ' \wedge ({{\varvec{s}}}_5 = {{\varvec{s}}}_5') \wedge ({{\varvec{x}}}= \epsilon ) \wedge \lnot \gamma \wedge \gamma ' \wedge I_3({{\varvec{s}}}_3) \wedge I_3({{\varvec{s}}}_3')) \vee \\&(\lnot \alpha \wedge \lnot \alpha ' \wedge ({{\varvec{s}}}_5 = {{\varvec{s}}}_5') \wedge \gamma \wedge \gamma ' \wedge T_3({{\varvec{x}}}, {{\varvec{s}}}_3, {{\varvec{s}}}_3')) \vee \\&(\lnot \alpha \wedge \alpha ' \wedge T_5(\lhd , {{\varvec{s}}}_5, {{\varvec{s}}}_5') \wedge ({{\varvec{x}}} = \epsilon ) \wedge \gamma \wedge \lnot \gamma ' \wedge I_3({{\varvec{s}}}_3') \wedge O_3({{\varvec{s}}}_3)) \vee \\&(\alpha \wedge \alpha ' \wedge \exists {{\varvec{y}}}. [T_5({{\varvec{y}}}, {{\varvec{s}}}_5, {{\varvec{s}}}_5')] \wedge ({{\varvec{x}}} = \epsilon ) \wedge \lnot \gamma \wedge \lnot \gamma '\wedge I_3({{\varvec{s}}}_3) \wedge I_3({{\varvec{s}}}_3')) \vee \\&(\alpha \wedge \lnot \alpha ' \wedge T_5(\rhd , {{\varvec{s}}}_5, {{\varvec{s}}}_5') \wedge ({{\varvec{x}}}=\epsilon ) \wedge \lnot \gamma \wedge \lnot \gamma ' \wedge I_3({{\varvec{s}}}_3) \wedge I_3({{\varvec{s}}}_3')), \\ I_\textsc {Rep}({{\varvec{s}}})= & {} \lnot \gamma \wedge I_5({{\varvec{s}}}_5) \wedge I_3({{\varvec{s}}}_3), \\ O_\textsc {Rep}({{\varvec{s}}})= & {} \lnot \gamma \wedge O_5({{\varvec{s}}}_5) \wedge I_3({{\varvec{s}}}_3), \end{aligned}$$

for \({{\varvec{s}}}= ({{\varvec{s}}}_5, {{\varvec{s}}}_3, \gamma )\).

The constructed circuit is of size \(O(|\mathcal {C}(A_1)|+|\mathcal {C}(A_2)|+|\mathcal {C}(A_h)|+|\mathcal {C}(A_3)|)\) and has \(|{{\varvec{x}}}|\) quantified internal variables.

3.6 Emptiness Checking

One important query, IsEmp(A), about an automaton A is asking whether the language \(\mathcal {L}(A)\) is empty. We employ property directed reachability (PDR) [11], an implementation of the state-of-the-art model checking algorithm IC3 [5] in the Berkeley ABC system [6], to test whether an accepting state is reachable from an initial state in A. Note that PDR only accepts a sequential circuit specified in transition functions, rather than a transition relation, as input; furthermore, it assumes the given circuit shall have a single initial state. Unfortunately because our automata are nondeterministic in general, their nondeterministic transitions can only be specified using transition relations and they may have multiple initial states.

To overcome the above mismatch between transition relation and transition function, we devise a mechanism converting \((T({{\varvec{x}}}, {{\varvec{s}}}, {{\varvec{s}}}'), I({{\varvec{s}}}), O({{\varvec{s}}}))\) representation of NFA A into a form acceptable by PDR as follows. To handle the single initial state restriction, let \(A_\epsilon \) be the automaton accepting only the \(\epsilon \) string, which is composed of a single initial accepting state without any transition. We modify A by \(\textsc {Cat}(A_\epsilon ,A)\) to enforce a single initial state. Moreover, to convert a transition relation to a set of transition functions, we introduce n new input variables \({{\varvec{y}}}\) for \(n = |{{\varvec{s}}}|\) and a new state variable z with initial value 1, and construct a new sequential circuit with

  • the output function: \(O_\textsc {IsEmp} = (O({{\varvec{s}}}) \wedge z)\), and

  • the next-state functions: \(\delta _i = (y_i)\) for state variables \(s_i\), \(i = 1, \ldots , n\), and \(\delta _{n+1} = (T({{\varvec{x}}},{{\varvec{s}}},{{\varvec{y}}}) \wedge z)\) for the state variable z.

Fig. 1(e) shows the corresponding circuit construction, where the rectangular boxes denote state-holding elements. With these conversions, PDR can be directly applied on the constructed new circuit. The constructed circuit is of size \(O(\mathcal {C}(A))\) and has \((|{{\varvec{s}}}|+1)\) state variables and \((|{{\varvec{x}}}|+|{{\varvec{y}}}|)\) input variables. The complexity of checking language emptiness is PSPACE-complete in the circuit size of the underlying automaton.

4 Counterexample Generation

The automata manipulation flow specified in a dependency graph often ends with an IsEmp query asking whether a vulnerability exists for the application under verification. If the answer to IsEmp is negative, it is desirable to generate a counterexample witnessing the vulnerability. Such a counterexample should be expressed in terms of the inputs to the application. However since the counterexample to the IsEmp query is a trace demonstrating the reachability from an initial state to an accepting state in the final automaton, it does not directly correspond to the counterexample at the inputs. By counterexample generation, we compute counterexample traces at the inputs of a dependency graph that together induce a specific counterexample trace at the sink node. Prior automata-based methods cannot easily generate such counterexamples because the output automaton resulted from an automata operation does not contain information about its input automata whereas our circuit construction preserves such information through the introduced auxiliary variables.

Below we show how to backtrack from the negative answer to IsEmp to extract the input counterexamples. The backtrack process traverses the dependency graph in a reverse topological order and deduces the upstream counterexamples according to the corresponding operations in the following. Notice that an automata circuit iteratively constructed by our method may contain internally quantified variables. These variables are treated as free variables in PDR computation without explicit quantifier elimination, and their corresponding assignments are determined by PDR and returned along with the trace information.

Intersection. Let \((p_1, q_1), (\sigma _1, \rho _1, \varrho _1), (p_2, q_2), (\sigma _2, \rho _2, \varrho _2), \ldots , (p_\ell , q_\ell )\) be the counterexample trace of automaton \(A_\textsc {Int} = \textsc {Int}(A_1, A_2)\), where \(p_i \in [\![ {{\varvec{s}}}_1 ]\!], q_i \in [\![ {{\varvec{s}}}_2 ]\!], \sigma _i \in \varSigma \), \(\rho _i \in \varSigma ^{k}\), \(\varrho _i \in \varSigma ^{l}\), for some \(k, l \ge 0\) and \(({{\varvec{s}}}_1, {{\varvec{s}}}_2)\) being the state variables of \(A_\textsc {Int}\) as constructed in Sect. 3.1. Let the values \(\rho _i \in \varSigma ^k\) and \(\varrho _i \in \varSigma ^l\) correspond to the assignments to the internally quantified variables of \(A_1\) and \(A_2\), respectively. Then the counterexample traces of \(A_1\) and \(A_2\) can be extracted backward by the following rule.

figure g

Union. Let \((q_1, c), (\sigma _1, \rho _1), (q_2, c), (\sigma _2, \rho _2), \ldots , (q_\ell , c)\) be the counterexample trace of automaton \(A_\textsc {Uni} = \textsc {Uni}(A_1, A_2)\), where \(q_i \in [\![ {{\varvec{s}}}_2 ]\!]\), \(c \in [\![ \alpha ]\!]\), \(\sigma _i \in \varSigma \), and \(\rho _i \in \varSigma ^{k}\), for some \(k \ge 0\) and \({{\varvec{s}}}_2\) being the state variables of \(A_\textsc {Uni}\) as constructed in Sect. 3.2. Let the values \(\rho _i \in \varSigma ^k\) correspond to the assignments to the internally quantified variables of \(A_1\) or \(A_2\). The the counterexample traces of \(A_1\) and \(A_2\) can be extracted backward by the following rules.

figure h

Concatenation. Let \((q_1, c_1)\), \((\sigma _1, \rho _1)\), \((q_2, c_2)\), \((\sigma _2, \rho _2)\), ..., \((q_\ell , c_\ell )\) be the counterexample trace of automaton \(A_\textsc {Cat} = \textsc {Cat}(A_1, A_2)\), where \(q_i \in [\![ {{\varvec{s}}}_2 ]\!]\), \(c_i \in [\![ \alpha ]\!]\), \(\sigma _i \in \varSigma \), and \(\rho _i \in \varSigma ^{k_i}\), for some \(k_i \ge 0\) and \(({{\varvec{s}}}_2, \alpha )\) being the state variables of \(A_\textsc {Cat}\) as constructed in Sect. 3.3. Let the values \(\rho _i \in \varSigma ^{k_i}\) correspond to the assignments to the internally quantified variables of \(A_1\) or \(A_2\). Then the counterexample traces of \(A_1\) and \(A_2\) can be extracted backward by the following rule.

figure i

where each \(p_j=(q_j,0)\) for all \(j \le i\), \(p_j=(q_j,1)\) for all \(j \ge i+1\), and \(z_j=(\sigma _j, \rho _j)\) for all \(j \ne i\), and \(z_i=(\epsilon , \rho _i)\).

Replacement. Let \((p_1,\) \(c_1,\) \(q_1,\) \(r_1,\) \(d_1)\), \((\sigma _1,\) \(\rho _1,\) \(\varrho _1)\), \(\ldots \), \((p_{n_1},\) \(c_{n_1},\) \(q_{n_1},\) \(r_{n_1},\) \(d_{n_1})\), \((\sigma _{n_1},\) \(\rho _{n_1},\) \(\varrho _{n_1})\), \((p_{n_1+1},\) \(c_{n_1+1},\) \(q_{n_1+1},\) \(r_{n_1+1},\) \(d_{n_1+1})\), \((\sigma _{n_1+1},\) \(\rho _{n_1+1},\) \(\varrho _{n_1+1})\), \(\ldots \), \((p_{n_2},\) \(c_{n_2},\) \(q_{n_2},\) \(r_{n_2},\) \(d_{n_2})\), \((\sigma _{n_2},\) \(\rho _{n_2},\) \(\varrho _{n_2})\), \((p_{n_2+1},\) \(c_{n_2+1},\) \(q_{n_2+1},\) \(r_{n_2+1},\) \(d_{n_2+1})\), \((\sigma _{n_2+1},\) \(\rho _{n_2+1},\) \(\varrho _{n_2+1})\), \(\ldots \), \((p_{\ell },\) \(c_{\ell },\) \(q_{\ell },\) \(r_{\ell },\) \(d_{\ell })\) be the counterexample trace of automaton \(A_\textsc {Rep}= \textsc {Rep}(A_1,A_2,A_3)\), where \(p_i \in [\![ {{\varvec{s}}}_1 ]\!]\), \(c_i \in [\![ \alpha ]\!]\), \(q_i \in [\![ {{\varvec{s}}}_4 ]\!]\), \(r_i \in [\![ {{\varvec{s}}}_3 ]\!]\), \(d_i \in [\![ \gamma ]\!]\), \(\sigma _i, \varrho _i \in \varSigma \), and \(\rho _i \in \varSigma ^k\), for some \(k \ge 0\) and \(({{\varvec{s}}}_1, \alpha , {{\varvec{s}}}_4, {{\varvec{s}}}_3, \gamma )\) being the state variables of \(A_{\textsc {Rep}}\) as constructed in Sect. 3.5. The trace must have the following form: Consider \((p_{n_{i}+1}, c_{n_{i}+1}, q_{n_{i}+1}, r_{n_{i}+1}, d_{n_{i}+1}), (\sigma _{n_{i}+1}, \rho _{n_{i}+1}, \varrho _{n_{i}+1}), \ldots , (p_{n_{i+1}}, c_{n_{i+1}}, q_{n_{i+1}}, r_{n_{i+1}}, d_{n_{i+1}})\). (Notice the subtle subscript difference between \({n_{i}+1}\) and \({n_{i+1}}\).) For \(i=3m\), we have \(c_j=0\), \(d_j=0\) for \(n_{i}+1 \le j \le n_{i+1}\), and \(\sigma _{n_{i}+1}\sigma _{n_{i}+2} \ldots \sigma _{n_{i+1}-1} \notin \varSigma ^*.\mathcal {L}(A_2).\varSigma ^*\). For \(i=3m+1\), we have \(c_j=0\), \(d_j=1\) for \(n_{i}+1 \le j \le n_{i+1}\), and \(\sigma _{n_{i}+1}\sigma _{n_{i}+2}\ldots \sigma _{n_{i+1}-1} \in \mathcal {L}(A_3)\). For \(i=3m+2\), we have \(c_j=1\), \(d_j=0\), \(\sigma _j=\epsilon \) for \(n_{i}+1 \le j \le n_{i+1}\), and \(\varrho _{n_{i}+1}\varrho _{n_{i}+2}\ldots \varrho _{n_{i+1}-1} \in \mathcal {L}(A_2)\). Also \(\sigma _{n_i} = \epsilon \) for all i.

Let the values \(\rho _i \in \varSigma ^k\) and \(\varrho _i \in \varSigma \) correspond to the assignments to the internally quantified variables of \(A_1\) and to the assignments to the internally quantified variables added in the construction of \(A_\textsc {Rep}\), respectively. Then the counterexample trace of \(A_1\) can be extracted backward by the following rule.

figure j

where each \(\omega _i\) denote the trace \((p_{n_{i-1}+1},\) \(c_{n_{i-1}+1},\) \(q_{n_{i-1}+1},\) \(r_{n_{i-1}+1},\) \(d_{n_{i-1}+1})\), \((\sigma _{n_{i-1}+1},\) \(\rho _{n_{i-1}+1},\) \(\varrho _{n_{i-1}+1})\), \(\ldots \), \((p_{n_{i}},\) \(c_{n_{i}},\) \(q_{n_{i}},\) \(r_{n_{i}},\) \(d_{n_{i}})\), each \(z_i\) denote \((\epsilon ,\) \(\rho _{n_{i}},\) \(\varrho _{n_{i}})\), each \(\omega _{i}{\dag }\) denote the trace \(p_{n_{i-1}+1}\), \((\sigma _{n_{i-1}+1},\) \(\rho _{n_{i-1}+1})\), \(p_{n_{i-1}+2}\), \((\sigma _{n_{i-1}+2},\) \(\rho _{n_{i-1}+2})\), \(\ldots \), \(p_{n_{i}}\), and each \(\omega _i^\ddag \) denote the trace \(p_{n_{i-1}+1}\), \((\varrho _{n_{i-1}+1},\) \(\rho _{n_{i-1}+1})\), \(p_{n_{i-1}+2}\), \((\varrho _{n_{i-1}+2},\) \(\rho _{n_{i-1}+2})\), \(\ldots \), \(p_{n_{i}}\). Also, for a trace \(\omega = p_1, \sigma _1, \ldots , p_i, \sigma _i, p_{i+1}\), we denote its tail-removed subtrace \(p_1, \sigma _1, \ldots , p_i, \sigma _i\) as \(\omega ^-\).

5 Filter Generation

In addition to counterexample generation, one may further generate filters (also called vulnerability signatures in [31]) to block malicious input strings from the considered web application. By computing filters backward in the dependency graph, the filters for the input strings to an application can be obtained. The derived filters in our circuit representations are amenable for further hardware or firmware implementation to support a high-speed and low-power way of filtering malicious inputs from a web application. Notice that our circuit representation characterizes NFA in general, and further determinization may be needed for firmware or hardware implementation of filters. Although automata determinization can be costly, it is doable. Below we study how filter generation can be done under the proposed circuit representation.

First of all, the filter for the sink node of the dependency graph is available, assuming that sensitive strings to the underlying string manipulating program are known a priori. Moreover, consider an operator Op on a given set of input automata \(A_1\), ..., \(A_k\) yielding \(A = \textsc {Op}(A_1, \ldots , A_k)\). Let B be an automaton with its language \(\mathcal {L}(B)\subseteq \mathcal {L}(A)\) containing all illegal strings in \(\mathcal {L}(A)\). We intend to construct the filter automaton \(B_{i}\) for some \(i = 1, \ldots , k\) of concern such that \(\mathcal {L}(B_{i})\subseteq \mathcal {L}(A_i)\) and any \(\varvec{\sigma } \in \mathcal {L}(A_i)\) satisfies \((\mathcal {L}(\textsc {Op}(A_1, \ldots , A_{i-1}, A_{\varvec{\sigma }}, A_{i+1}, \ldots , A_k)) \cap \mathcal {L}(B)) = \emptyset \) if and only if \(\varvec{\sigma } \notin \mathcal {L}(B_{i})\), where \(A_{\varvec{\sigma }}\) denotes the automaton that accepts exactly the string \(\varvec{\sigma }\). Note that \(\mathcal {L}(B_i)\) satisfying the above condition is a minimal filter provided that the relation among the inputs of an automata operation is ignored. Since the above condition guarantees that for each string in \(B_i\), there exists a set of strings in other \(A_j\)’s, \(j\ne i\), such that some string in B is generated after apply Op on this set of strings of \(B_i\) and \(A_j\)’s. Under the ignorance of the relation among the inputs of Op, a string should be kept in the language of filter automaton \(B_i\) as long as it may possibly result in a string in B through Op. The different Op cases are detailed in the following.

Intersection. Given the filter automaton B for the automaton \(A = \textsc {Int}(A_1, A_2)\), the filter B can be directly applied as a filter for \(A_1\) as well as \(A_2\).

Union. Given the filter automaton B for \(A = \textsc {Uni}(A_1, A_2)\), observe that every string in \(\mathcal {L}(B)\) is in \(\mathcal {L}(A_1)\) or in \(\mathcal {L}(A_2)\). Hence automata \(B_{1} = \textsc {Int}(A_1, B)\) and \(B_{2} = \textsc {Int}(A_2, B)\) form legitimate filters for \(A_1\) and \(A_2\), respectively.

Concatenation. Given the filter automaton B for \(A = \textsc {Cat}(A_1, A_2)\), to generate the corresponding filters \(B_{1}\) and \(B_{2}\) for \(A_1\) and \(A_2\), respectively, we first construct \(B^\dag = \textsc {Int}(A, B)\). Clearly, \(\mathcal {L}(B^\dag )\) equals \(\mathcal {L}(B)\) because \(\mathcal {L}(B) \subseteq \mathcal {L}(A)\). By the circuit construction of A, the auxiliary state variable \(\alpha \) distinguishes between the substrings from \(\mathcal {L}(A_1)\) and the substrings from \(\mathcal {L}(A_2)\). As this information may not be seen in B, the purpose of this intersection is to identify the separation points between the two substring sources. Let \(B_1\) be a copy of \(B^\dag \) but with the input symbol on every transition between states of \(\alpha =1\) being replaced with \(\epsilon \). Consider a trace \((q_1, c_1)\), \(\sigma _1\), \(\ldots \), \((q_i, c_i)\), \(\epsilon \), \((q_{i+1}, c_{i+1})\), \(\epsilon \), \(\ldots \), \((q_\ell , c_\ell )\) accepted by \(B_1\), where \((q_j,c_j)\in [\![ {{\varvec{s}}} ]\!]\) for \({{\varvec{s}}}\) being the state variables of \(B_1\), and \(c_j \in [\![ \alpha ]\!]\) with \(c_j = 0\) for \(j \le i\) and \(c_j = 1\) for \(j \ge i+1\). By the construction of \(B_1\), there should be a trace \((q_1, c_1)\), \(\sigma _1\), \(\ldots \), \((q_i, c_i)\), \(\epsilon \), \((q_{i+1}, c_{i+1})\), \(\sigma _{i+1}\), \(\ldots \), \((q_\ell , c_\ell )\) accepted by \(B^\dag \). The existence of such a trace ensures \(\sigma _1\sigma _2\ldots \sigma _{i-1} \in \mathcal {L}(A_1)\), \(\sigma _{i+1}\sigma _{i+2}\ldots \sigma _{\ell -1} \in \mathcal {L}(A_2)\), and \(\sigma _1 \sigma _2 \ldots \sigma _{\ell -1} \in \mathcal {L}(B)\). The above trace accepted by \(B^\dag \) also ensures for each string, \(\varvec{\sigma }\in \mathcal {L}(B_1)\) if and only if there exists another string \(\varvec{\rho }\) in \(A_2\) such that \(\varvec{\sigma }.\varvec{\rho } \in \mathcal {L}(B)\). So \(B_1\) forms a legitimate filter for \(A_1\). Similarly, let \(B_2\) be a copy of \(B^\dag \) but with the input symbol on every transition between states of \(\alpha =0\) being replaced with \(\epsilon \). Then \(B_2\) forms a legitimate filter for \(A_2\).

Replacement. Given the filter automaton B for \(A = \textsc {Rep}(A_1, A_2, A_3)\), to generate the filter \(B_1\) for automaton \(A_1\), each string in \(\mathcal {L}(B)\) has the form \(\varvec{\sigma }_1 \varvec{\tau }_1 \varvec{\sigma }_2 \varvec{\tau }_2 \ldots \varvec{\sigma }_{\ell }\), where \(\varvec{\sigma }_i \in \overline{\varSigma ^*. \mathcal {L}(A_2). \varSigma ^*}\) and \(\varvec{\tau }_i \in \mathcal {L}(A_3)\) for \(i = 1, \ldots , \ell \). We recognize each \(\varvec{\tau }_i\) and replace it with some string \(\varvec{\rho }_i \in \mathcal {L}(A_2)\). We then remove from the resultant language those strings not in \(A_1\) by intersecting it with \(A_1\). Therefore, \(B_1\) can be constructed as follows.

First, similar to the construction of \(A_1^{\triangleleft \triangleright }\) in Sect. 3.5, we build automaton \(B^{\triangleleft \triangleright }\), which parenthesizes any substrings of a string in \(\mathcal {L}(A_1)\). Second, similar to the construction of \(A_4\) in Sect. 3.5, we build automaton \(B_4\), which accepts the strings \(\{(\varvec{\sigma }_1. \lhd . \varvec{\tau }_1. \rhd . \varvec{\sigma }_2. \lhd . \varvec{\tau }_2. \rhd \ldots )\in \varSigma ^*\ |\ \varvec{\sigma }_i \in \overline{\varSigma ^*. \mathcal {L}(A_2). \varSigma ^*}\) and \(\varvec{\tau }_i \in \mathcal {L}(A_3)\}\). Third, let \(B_5 = \textsc {Int}(B^{\triangleleft \triangleright }, B_4)\). Hence \(\mathcal {L}(B_5)=\{(\varvec{\sigma }_1.\lhd .\varvec{\tau }_1.\rhd .\varvec{\sigma }_2.\lhd .\varvec{\tau }_2.\rhd \ldots )\in \varSigma ^*\ |\ (\varvec{\sigma }_1. \varvec{\tau }_1. \varvec{\sigma }_2. \varvec{\tau }_2 \ldots ) \in \mathcal {L}(B)\) and \(\varvec{\sigma }_i \in \overline{\varSigma ^*.\mathcal {L}(A_2).\varSigma ^*}\) and \(\varvec{\tau }_i \in \mathcal {L}(A_3)\}\). Then, in \(\mathcal {L}(B_5)\) instead of replacing substrings \(\lhd \mathcal {L}(A_3) \rhd \) with strings in \(\mathcal {L}(A_2)\), we replace \(\lhd \) with \(\mathcal {L}(A_2)\), \(\rhd \) with \(\epsilon \), and \(\mathcal {L}(A_3)\) with \(\epsilon \). Let the resultant automaton be \(B_1^\dag \). Finally, \(B_1 = \textsc {Int}(B_1^\dag , A_1)\) forms a legitimate filter for \(A_1\).

The fact that \(B_1\) is a legitimate filter for \(A_1\) can be shown as follows. Consider a string \(\varvec{\sigma }=\varvec{\sigma }_1.\varvec{\rho }_1.\varvec{\sigma }_2.\varvec{\rho }_2 \ldots \notin \mathcal {L}(B_1)\), where \(\varvec{\sigma }_i \notin \varSigma ^*.\mathcal {L}(A_2).\varSigma ^*\) and \(\varvec{\rho }_i \in \mathcal {L}(A_2)\). Also consider another string \(\varvec{\sigma }_1.\varvec{\tau }_1.\varvec{\sigma }_2.\varvec{\tau }_2\ldots \) obtained from replacing each \(\varvec{\rho }_i\) with \(\varvec{\tau }_i\in \mathcal {L}(A_3)\). If \(\varvec{\sigma }_1.\varvec{\tau }_1.\varvec{\sigma }_2.\varvec{\tau }_2\ldots \in \mathcal {L}(B)\), then we have \(\varvec{\sigma }_1.\lhd .\varvec{\tau }_1.\rhd .\varvec{\sigma }_2.\lhd .\varvec{\tau }_2.\rhd \ldots \in \mathcal {L}(B^{\triangleleft \triangleright })\). It is easy to see that \(\varvec{\sigma }_1.\lhd .\varvec{\tau }_1.\rhd .\varvec{\sigma }_2.\lhd .\varvec{\tau }_2.\rhd \ldots \in \mathcal {L}(B_5)\). Finally, for each \(\lhd .\varvec{\tau }_i. \rhd \), replacing \(\lhd \) with \(\varvec{\rho }_i\), replacing \(\varvec{\tau }_i\) with \(\epsilon \), and replacing \(\rhd \) with \(\epsilon \) yield \(\varvec{\sigma }_1.\varvec{\rho }_1.\varvec{\sigma }_2.\varvec{\rho }_2 \ldots \in \mathcal {L}(B_1)\), which contradicts to the assumption \(\varvec{\sigma }_1.\varvec{\rho }_1.\varvec{\sigma }_2.\varvec{\rho }_2 \ldots \notin \mathcal {L}(B_1)\). So we have \(\mathcal {L}(\textsc {Rep}(A_{\varvec{\sigma }}, A_2, A_3)) \cap \mathcal {L}(B)=\emptyset \) for any string \(\varvec{\sigma } \notin \mathcal {L}(B_1)\). Similarly, consider string \(\varvec{\sigma }=\varvec{\sigma }_1.\varvec{\rho }_1.\varvec{\sigma }_2.\varvec{\rho }_2 \ldots \in \mathcal {L}(B_1)\), where \(\varvec{\sigma }_i \notin \varSigma ^*.\mathcal {L}(A_2).\varSigma ^*\) and \(\varvec{\rho }_i \in \mathcal {L}(A_2)\). Then it is in \(\mathcal {L}(B_1^\dag )\). By the construction of \(B_1^\dag \), there should be another string \(\varvec{\sigma }_1.\lhd .\varvec{\tau }_1.\rhd .\varvec{\sigma }_2.\lhd .\varvec{\tau }_2.\rhd \ldots \in \mathcal {L}(B_5)\), where \(\varvec{\tau }_i \in \mathcal {L}(A_3)\). We have \(\varvec{\sigma }_1.\lhd .\varvec{\tau }_1.\rhd .\varvec{\sigma }_2.\lhd .\varvec{\tau }_2.\rhd \ldots \in \mathcal {L}(B^{\triangleleft \triangleright })\), and hence \(\varvec{\sigma }_1.\varvec{\tau }_1.\varvec{\sigma }_2.\varvec{\tau }_2 \ldots \in \mathcal {L}(B)\). It is easy to see that \(\varvec{\sigma }_1.\varvec{\tau }_1.\varvec{\sigma }_2.\varvec{\tau }_2 \ldots \in \mathcal {L}(\textsc {Rep}(A_{\varvec{\sigma }}, A_2, A_3))\), which means \(\mathcal {L}(\textsc {Rep}(A_{\varvec{\sigma }}, A_2, A_3)) \cap \mathcal {L}(B) \ne \emptyset \). Consequently \(B_1\) characterizes the desired language.

6 Extension to Symbolic Finite Automata

Symbolic finite automata (SFA) [26] extend conventional finite automata by allowing transition conditions to be specified in terms of predicates over a Boolean algebra with a potentially infinite domain. Formally, an SFA A is a 5-tuple \((Q, \mathcal {D}, I, \varDelta , O)\), where Q is a finite set of states, \(\mathcal {D}\) is the designated domain, \(I\subseteq Q\) is the set of initial states (here we allow multiple initial states in contrast to the standard single-initial-state assumption of SFA), \(\varDelta : Q \times \varPsi \times Q\) is the move relation for \(\varPsi \) being the set of all quantifier-free formulas with at most one free variable, say \(\chi \), over a Boolean algebra of domain \(\mathcal {D}\), \(O \subseteq Q\) is the set of accepting states. We assume \(\epsilon \) transitions are allowed and properly encoded in \(\varDelta \) in an SFA. Since \(\mathcal {D}\) may not be bounded, a predicate logic formula over variable \(\chi \) cannot be represented with logic circuits. We separate predicates from the logic circuit representation of SFA by abstracting each formula \(\psi \) appearing in \(\varDelta \) with its designated propositional variable \(x_\psi \). Let \([\![ \psi ]\!]\) be extended to denote the set of solution values of \(\chi \) satisfying \(\psi \). Then the move relation of an SFA can be expressed with a transition relation

$$\begin{aligned} T({{\varvec{x}}}, {{\varvec{s}}}, {{\varvec{s}}}') = \bigvee _{(p, \psi , q)\in \varDelta }(x_\psi \wedge ({{\varvec{s}}}=p) \wedge ({{\varvec{s}}}'=q)) \end{aligned}$$

and a predicate relation

$$\begin{aligned} P({{\varvec{x}}},\chi ) = \bigwedge _{(p, \psi , q)\in \varDelta }(x_\psi \leftrightarrow (\chi \in [\![ \psi ]\!])). \end{aligned}$$

Therefore we can represent an SFA A with four characteristic functions I, O, T, and P.

With the above construction, our circuit constructions of Sect. 3 naturally extend to SFA except that the predicate relation has to be additionally handled as follows. For SFA \(A_\textsc {Int} = \textsc {Int}(A_1, A_2)\), the predicate relation

$$\begin{aligned} P_\textsc {Int}({{\varvec{x}}}, \chi ) = P_1({{\varvec{x}}}_1, \chi ) \wedge P_2({{\varvec{x}}}_2, \chi ) \wedge (x_{\chi =\epsilon } \leftrightarrow \chi =\epsilon ), \end{aligned}$$

for \({{\varvec{x}}}=({{\varvec{x}}}_1, {{\varvec{x}}}_2, x_{\chi =\epsilon })\).

For SFA \(A_\textsc {Uni} = \textsc {Uni}(A_1, A_2)\), the predicate relation

$$\begin{aligned} P_\textsc {Uni}({{\varvec{x}}}, \chi ) = P_1({{\varvec{x}}}_1, \chi ) \wedge P_2({{\varvec{x}}}_2, \chi ), \end{aligned}$$

for \({{\varvec{x}}}=({{\varvec{x}}}_1, {{\varvec{x}}}_2)\).

For SFA \(A_\textsc {Cat} = \textsc {Cat}(A_1, A_2)\), the predicate relation

$$\begin{aligned} P_\textsc {Cat}({{\varvec{x}}}, \chi ) = P_1({{\varvec{x}}}_1, \chi ) \wedge P_2({{\varvec{x}}}_2, \chi ) \wedge (x_{\chi =\epsilon } \leftrightarrow \chi =\epsilon ), \end{aligned}$$

for \({{\varvec{x}}}=({{\varvec{x}}}_1, {{\varvec{x}}}_2, x_{\chi =\epsilon })\).

For SFA \(A_\textsc {Rep} = \textsc {Rep}(A_1, A_2, A_3)\), we construct the predicate relation for \(A_{\textsc {Rep}}\) as follows. The predicate relation of SFA \(A_1^{\triangleleft \triangleright }\) is first obtained from \(A_1\) by

$$\begin{aligned} P_{1}^{\triangleleft \triangleright }({{\varvec{x}}}_1^{\triangleleft \triangleright }, \chi )= & {} P_1({{\varvec{x}}}_1) \wedge (x_{\chi =\lhd }\leftrightarrow (\chi =\lhd )) \wedge (x_{\chi =\rhd } \leftrightarrow (\chi =\rhd )) \wedge \\&(x_{\chi \ne \lhd }\leftrightarrow (\chi \ne \lhd )) \wedge (x_{\chi \ne \rhd } \leftrightarrow (\chi \ne \rhd )), \end{aligned}$$

for \({{\varvec{x}}}_1^{\triangleleft \triangleright }=({{\varvec{x}}}_1, x_{\chi =\lhd }, x_{\chi =\rhd }, x_{\chi \ne \lhd }, x_{\chi \ne \rhd })\). Then the predicate relation of \(A_4\) is constructed from those of \(A_2\) and \(A_h\) by

$$\begin{aligned} P_{4}({{\varvec{x}}}_4, \chi )= & {} P_2({{\varvec{x}}}_2, \chi ) \wedge P_h({{\varvec{x}}}_h, \chi ) \wedge (x_{\chi =\lhd } \leftrightarrow \chi =\lhd ) \wedge (x_{\chi =\rhd } \leftrightarrow \chi =\rhd ) \wedge \\&(x_{\chi \ne \lhd }\leftrightarrow \chi \ne \lhd ) \wedge (x_{\chi \ne \rhd } \leftrightarrow \chi \ne \rhd ), \end{aligned}$$

for \({{\varvec{x}}}_4=({{\varvec{x}}}_2, {{\varvec{x}}}_h, x_{\chi =\lhd }, x_{\chi =\rhd }, x_{\chi \ne \lhd }, x_{\chi \ne \rhd })\). Then the predicate relation of \(A_5\) is obtained by

$$\begin{aligned} P_5({{\varvec{x}}}_5, \chi ) = P_1^{\triangleleft \triangleright }({{\varvec{x}}}_1^{\triangleleft \triangleright }, \chi ) \wedge P_4({{\varvec{x}}}_4, \chi ) \wedge (x_{\chi =\epsilon } \leftrightarrow \chi =\epsilon ), \end{aligned}$$

for \({{\varvec{x}}}_5=({{\varvec{x}}}_1^{\triangleleft \triangleright }, {{\varvec{x}}}_4, x_{\chi =\epsilon })\). Finally, the transition and predicate relations of SFA \(A_\textsc {Rep}\) can be obtained by

$$\begin{aligned} T_\textsc {Rep}({{\varvec{x}}}, {{\varvec{s}}}, {{\varvec{s}}}')= & {} (\lnot \alpha \wedge \lnot \alpha ' \wedge T_5({{\varvec{x}}}_5, {{\varvec{s}}}_5, {{\varvec{s}}}_5') \wedge \lnot \gamma \wedge \lnot \gamma ' \wedge I_3({{\varvec{s}}}_3) \wedge I_3({{\varvec{s}}}_3') ) \vee \\&(\lnot \alpha \wedge \lnot \alpha ' \wedge ({{\varvec{s}}}_5={{\varvec{s}}}_5') \wedge (x_{\chi = \epsilon }) \wedge \lnot \gamma \wedge \gamma ' \wedge I_3({{\varvec{s}}}_3) \wedge I_3({{\varvec{s}}}_3') ) \vee \\&(\lnot \alpha \wedge \lnot \alpha ' \wedge ({{\varvec{s}}}_5={{\varvec{s}}}_5') \wedge \gamma \wedge \gamma ' \wedge T_3({{\varvec{x}}}_3, {{\varvec{s}}}_3, {{\varvec{s}}}_3') ) \vee \\&(\lnot \alpha \wedge \alpha ' \wedge T_5({{\varvec{x}}}_5, {{\varvec{s}}}_5, {{\varvec{s}}}_5')|_{\varDelta [\chi /\lhd ]} \wedge \gamma \wedge \lnot \gamma ' \wedge O_3({{\varvec{s}}}_3) \wedge I_3({{\varvec{s}}}_3') \wedge (x_{\chi = \epsilon }) ) \vee \\&(\alpha \wedge \alpha ' \wedge T_5({{\varvec{y}}}, {{\varvec{s}}}_5, {{\varvec{s}}}_5') \wedge (x_{\chi =\epsilon }) \wedge \lnot \gamma \wedge \lnot \gamma ' \wedge I_3({{\varvec{s}}}_3) \wedge I_3({{\varvec{s}}}_3') ) \vee \\&(\alpha \wedge \lnot \alpha ' \wedge T_5({{\varvec{x}}}_5, {{\varvec{s}}}_5, {{\varvec{s}}}_5')|_{\varDelta [\chi /\rhd ]} \wedge \lnot \gamma \wedge \lnot \gamma ' \wedge I_3({{\varvec{s}}}_3) \wedge I_3({{\varvec{s}}}_3') \wedge (x_{\chi =\epsilon }) ), \end{aligned}$$
$$ P_\textsc {Rep}({{\varvec{x}}},\chi , {{\varvec{y}}}, \chi ^\dag ) = P_5({{\varvec{x}}}_5,\chi ) \wedge P_3({{\varvec{x}}}_3,\chi ) \wedge P_5({{\varvec{y}}},\chi ^\dag ), $$

where \({{\varvec{x}}}=({{\varvec{x}}}_5, {{\varvec{x}}}_3)\), \({{\varvec{y}}}\) is a set of newly introduced propositional variables for \(|{{\varvec{x}}}|\), \(\chi ^\dag \) is a newly introduced variable for \(\chi \) serving for existential quantifications, and \(T|_{\varDelta [\chi /a]}\) denotes transition relation T is obtained under the modified move relation \(\varDelta \) in which variable \(\chi \) is substituted with symbol a. (Here we avoid existentially quantifying out \({{\varvec{y}}}\) and \(\chi ^\dag \) by treating them as free variables.)

For emptiness checking of an SFA, we can treat the SFA as an infinite state transition system by considering \((\chi , {{\varvec{x}}}, {{\varvec{s}}})\) as the state variables. Let the transition relation be the conjunction of T and P, and let I and O be the initial and accepting state conditions, respectively, of the infinite state transition system. Then the model checking method [9], effectively PDR modulo theories, can be applied for reachability analysis.

7 Experimental Evaluation

Our tool, named SLOG, was implemented in the C language under the Berkeley logic synthesis and verification system ABC [6]. The experiments were conducted on a machine with Intel Xeon(R) 8-core CPU and 16 GB memory under the Ubuntu 12.04 LCS operating system.

We compared SLOG against other modern constraint solvers: CVC4 [3], Norn [1], Z3-str2 [34], and string analysis tools: JSA [8] and Stranger [30]. For the experiments, 20386 string analysis instances were generated from real web applications via Stranger [30]. The web applications includes Moodle, PHP-Fusion, etc., and these instances are tested for vulnerabilities such as SQL-injection, cross-site scripting (XSS), etc. Each instance corresponds to an acyclic dependency graph of a sink node in the program that consists of union, concatenation, and replacement operations. For each instance, we generated the string constraint that checks whether the dependency graph is vulnerable with respect to an attack pattern. String constraints were generated in the SMT-lib format for CVC4, Norn, and Z3-str2, and in the Java-program format for JSA.

Table 1. Statistics of solver performance

The statistics of the benchmark instances are as follows. There are 85919 concatenation operations in total distributed in 18898 instances, 510 string replacement operations in 255 instances, and 25160 union operations in 5109 instances. All of these 20386 instances have membership checking at the end to determine whether an attack string can reach the sink node. All the solvers except for Norn, which does not support the replacement operation, provide full support on these string operations. Timeout limits 300 and 9000 s were set for small and large instances, respectively. An instance with fewer (resp. no fewer) than 100 concatenation operations is classified as small (resp. large).

The results of the solvers on the total 20386 instances are shown in Table 1, where #SAT, #UNS, #TO, #FL, and #Run denote the numbers of solved SAT, solved UNSAT, timeout, failed (with unexpected termination), and checked instances, respectively. The total runtimes for SAT and UNSAT instances are also shown in the table. Solvers SLOG, Stranger, CVC4, JSA and Z3-str2 checked all 20386 instances (runs) with successful rate 100 %, 100 %, 93.12 %, 99.98 %, 77.60 %, respectively; Norn checked 20131 instances with the successful rate 82.17 % without running the 255 instances with replacement operations.

To evaluate solver performance on instances of different sizes, we classify the 20386 instances into three groups: the replacement-free small ones (with fewer than 100 concatenations and without replacement operations), the replacement-free large ones (with no fewer than 100 concatenations and without replacement operations), and the ones with replacement operations. By the classification, there are 20091 replacement-free small instances, 40 replacement-free large instances, and 255 instances with replacement operations. Note that the replacement-free large instances also have a large number of union operations.

Fig. 2.
figure 2

Accumulated solving time for (a) replacement-free small instances, (b) replacement-free large instances, and (c) instances with replacement operations.

For the replacement-free small instances (under a 300-s timeout limit), the performances of solvers are shown in Fig. 2(a), where the x-axis is indexed by the number of solved instances, which are sorted by their runtimes in an ascending order for each solver, and the y-axis is indexed by the accumulated runtime in seconds. As shown in Fig. 2(a), SLOG successfully solves 20054 cases in 137670 seconds (with 37 timeout cases), outperforming Z3-str2 (13943 cases in 1399712 s), CVC4 (18829 cases in 5555 s) and Norn (16542 cases in 33969 s). In contrast, Stranger (20091 cases in 10590 s) and JSA (20087 cases in 15336 s) outperform SLOG on almost all the cases. For the replacement-free large instances (under a 9000-s timeout limit), both Z3-str2 and Norn failed to solve any due to timeout. As seen from Fig. 2(b), SLOG solved most of the large cases (with an average of 1750 s per case), while CVC4 solved fewer than half of the instances (19 out of 40) but took less time on solvable instances. Stranger and JSA outperform SLOG and other SMT-based solvers, being able to solve all the 40 cases with less time. For the instances with replacement operations (under a 300-s timeout limit), all solvers are applicable except for Norn. Figure 2(c) shows that the relative performances of the solvers are similar to those in the other two instance groups. The reason that Stranger outperforms SLOG might be explained by the fact that the emptiness checking of a sink automaton in Stranger is of constant time complexity (due to the canonicity of state-minimized DFA), while that in SLOG requires reachability analysis. Therefore as long as Stranger succeeds in building the sink automaton, it is likely to outperform SLOG.

With the auxiliary variables and other information embedded in the circuit construction, SLOG can generate counterexamples. We applied SLOG to find witnesses of all 8684 vulnerable instances. It took 524 s in total to generate counterexamples for all 8684 instances, only a small fraction of the total constraint solving time 65915 s. The high efficiency of counterexample generation in SLOG can be attributed to the fact that the assignments to the internally quantified variables in our circuit construction are already computed by PDR. There is no need to re-derive them in generating counterexample traces by the rules of Sect. 4 (Table 2).

Table 2. SLOG performance on counterexample generation

In summary, SLOG performed the best among the solvers with counterexample generation capability, including CVC4, Z3-str2 and Norn. In fact, a significant portion of runtime spent by SLOG is on running PDR for language emptiness checking. Although Stranger and JSA performed better than SLOG in runtime, both are incapable of finding as a witness the values of input nodes to a specific attack string in the sink node.

To justify that our circuit-based method can be more scalable than BDD-based methods for representing automata with large alphabets, consider the automata over alphabet \(\varSigma \times \varSigma \) with \(|\varSigma | = 2^n\) accepting the language \((a,a)^*\) for \(a \in \varSigma \). The automata have a linear O(n) AIG representation (\(4n+1\) gates), but have an exponential \(O(2^n)\) BDD representation (e.g., 46 BDD nodes for \(n=4\), 766 nodes for \(n=8\), and 196606 nodes for \(n=16\)) in MONA [7], which is used by Stranger. Although a good BDD variable ordering exists to reduce the BDD growth rate to linear in this example, a good BDD variable ordering can be hard to find and even may not exist in general. In addition, because SLOG represents NFA instead of DFA, it may avoid costly subset construction and can be more compact than (DFA-based) Stranger.

8 Discussions

While SLOG demonstrates its ability on string constraint solving and counterexample generation by taking advantage of circuit-based NFA representation, it should be noted that the compared string analysis tools have varied focuses and expressiveness of specifying (non)string constraints. CVC4 [3] is a SMT-based solver that supports many-sorted first-order logic. Norn [1] is another SMT-based string constraint solver that employs Craig interpolation to handle word equations over (unbounded length) string variables, constraints of string length, and regular language membership constraints. Z3-str2 [34] is a string theory plug-in built upon SMT solver Z3 [22]. These string solvers address string constraints with lengths and can generate witness for satisfying constraints. In the experimental evaluation, we did not consider length constraints when generating dependency graphs. String constraints with lengths are not currently supported by SLOG. The circuit-based representation could be extended to model arithmetic automata for automata-based string-length constraint solving [2, 33]. JSA is an explicit automata tool for analyzing the flow of strings and string operations in Java programs. Stranger is an MTBDD-based automata library for symbolic string analysis, which can be used to solve string constraints and compute pre- and post-images of string manipulation operations. JSA employes grammatical string analysis with regular language approximation and incorporates finite state transducers to support language-based replacement operations, while Stranger can conduct forward and backward reachability analysis of string manipulation programs along with DFA constructions for language operations. In the evaluation, we did not conduct analysis on cyclic dependency graphs that can be analyzed with JSA and Stranger. Conducting fixpoint computation on cyclic dependency graphs may require efficient complement operation in our circuit-based NFA representation that is not currently supported by SLOG.

9 Conclusions

We have presented a circuit-based NFA manipulation package for string analysis. Compared to BDD-based methods of automata representation, our circuit-based representation is scalable to automata with large alphabets. Our method avoids costly determinization whenever possible. It supports both counterexample generation and filter synthesis. In addition, extension to symbolic finite automata has been shown. Experiments have shown the unique benefits of our method. For future work, it would be interesting to explore the usage of SLOG as a string analysis engine in SMT solvers.