Keywords

figure a
figure b

1 Introduction

Regex Matching Regular expressions (regexes) are a fundamental formalism for various pattern-matching tasks. Many regex matching implementations, however, suffer from occasional super-linear growth of their execution time. Such excessive execution time can be exploited for DoS attacks—this is a vulnerability called regex denial of service (ReDoS). ReDoS is recognized as a significant security concern in many real-world systems, especially web services such as Stack Overflow and Cloudflare (see §2.4 for more details).

Need for Efficient Backtracking Regex Matching The principal cause of ReDoS is catastrophic backtracking, that is, the explosion of recursion in a backtracking-based matching algorithm.

In regex matching, in general, a regex r is converted into a non-deterministic finite automaton (NFA) \(\mathcal {A}\), and the latter is executed for an input string w. The non-determinism of \(\mathcal {A}\) can be resolved in either a depth-first or a breadth-first manner. The former is called backtracking regex matching, and the latter is the on-the-fly DFA construction.

Catastrophic backtracking and ReDoS are phenomena unique to the former (i.e., backtracking)—as is well-known, the time complexity of the on-the-fly DFA construction is linear (i.e., O(|w|)). Indeed, many modern regex implementations are based on the on-the-fly DFA construction, including RE2Footnote 1, Go’s regexpFootnote 2, and Rust’s regexFootnote 3.

It is practically essential, however, to make backtracking regex matching more efficient. A principal reason is consistency. Most existing regex matching implementations use backtracking, and they return only one matching position out of many (see §2.3). While it is possible to replace them with on-the-fly DFA matching, it is non-trivial to ensure consistency, that is, that the chosen matching position is the same as the original backtracking matching implementation. .NET’s regex implementation has a linear-time complexity backend using a derivative-based approach, which is compatible with a backtracking backend. Still, it does not support look-around and atomic grouping [28]. Once the returned matching position changes, it can unexpectedly affect the behavior of all the systems (e.g., web services) that use regex matching.

Another reason for improving backtracking regex matching is its extensibility. There are many extensions of regexes widely used—such as the ones we study, namely look-around and atomic grouping—and they are supported by few on-the-fly DFA matching implementations.

Existing Work: Linear-time Backtracking Matching with Memoization Memoization is a well-known technique for speeding up recursive computations. The recent work [10] shows that memoization can be applied to backtracking regex matching with consistency in mind. Specifically, the work [10] presents a backtracking matching algorithm that runs in O(|w|) time—thus, it is theoretically guaranteed to avoid catastrophic backtracking—for regexes without extensions. (They also mention application to extended regexes in [10], but we found issues in their discussion—see Remark 2).

Our Contribution: Linear-time Backtracking Matching for Some Extended Regexes In this paper, we present a linear-time backtracking matching algorithm for regexes with look-around and atomic grouping, two real-world extensions of regexes. It uses memoization in order to achieve a linear-time complexity. We also prove that it is consistent (i.e., it chooses the same matching position as the original algorithm without memoization).

The technical key to our algorithm is the design of suitable memoization tables. We follow the general idea in [10] of using memoization for backtracking matching, but our examination of its issues with extended regexes (Remark 2) shows that the range—i.e., the set of possible entries—of memoization tables should be suitably extended. Specifically, the range in [10] is \(\{\textbf{false}\}\), recording only matching failures; it is extended in our algorithm to \(\{ \textsf{Failure}(j)\mid j \in \{ 0, \dots , \nu (\mathcal {A}) \} \}\cup \{\textsf{Success}\}\). Here, \(\nu (\mathcal {A})\) is the maximum nesting depth of atomic grouping for the (extended) NFA \(\mathcal {A}\), defined in §5.

Our development is rigorous and systematic, based on the notion of NFA whose labels can themselves be NFAs. This extended notion of NFA is suggested in [10, Section IX.B]; in this paper, we formalize it and build its theory.

We experimentally evaluate our algorithm; the experiment results confirm its performance advantages. Additionally, we survey the usage status of look-around and atomic grouping—two regex extensions of our interest—in real-world regexes and demonstrate their wide usage (§6).

Technical Contributions We summarize our technical contributions.

  • We propose a backtracking matching algorithm for regexes with look-around, proving its linear-time complexity (§4). This algorithm fixes the issues in the algorithm in [10] (Remark 2) and restores correctness and linearity.

  • We also propose a backtracking matching algorithm for regexes with atomic grouping, proving its linear-time complexity (§4).

  • We experimentally confirm the performance of our algorithms (§6).

  • We investigate the usage status of look-around and atomic grouping in real-world regexes and confirm their wide usage (§6).

  • We establish a rigorous theoretical basis for our algorithms for extended regexes, namely NFAs with sub-automata (§2.6).

Organization We provide some preliminaries in §2, such as regex extensions of our interest. Our formalization of NFAs with sub-automata is also presented here. In §3, we discuss the work [10] that is closest to ours. We present our matching algorithm for regex with look-around in §4 and the one for regex with atomic grouping in §5. Then, we discuss our implementation and experimental evaluation in §6. We conclude in §7.

Some additional proofs and other materials are deferred to the appendices in the extended version [15].

Related Work Many related works are discussed elsewhere in suitable contexts. Here, we discuss other related works.

There are many theoretical studies on look-around and atomic grouping. The work [27] is a theoretical study of look-ahead operators; it shows how to convert them to finite automata. Another conversion based on derivatives is introduced in [26]. The work [3] conducts a fine-grained analysis of the size of DFAs obtained from converting regexes with look-ahead, improving the bounds given in [26, 27]. The work [5] discusses the relation between look-ahead operators and back-references in regexes. A recent study [22] presents a linear-time matching algorithm for regexes with look-around; it uses a memoization-like construct for efficiency. However, the compatibility with backtracking is not a concern there, unlike the current work. On atomic grouping, conversion to finite automata is proposed [4], where atomic grouping is simulated by look-ahead.

Another common regex extension is back-reference. We do not deal with this extension because 1) this extension is known to be non-regular (i.e., the language class defined by back-reference is beyond regular), and 2) its matching problem is known to be NP-complete [1] (thus the search for linear-time matching is doomed). There are other extensions (absent operators, conditional branching, etc.), but they are used less often (cf. §6).

ReDoS countermeasures are an active scientific topic. Besides efficient matching, there are two directions for them: ReDoS detection and ReDoS repair. ReDoS detection is a problem that determines whether a given regex can cause catastrophic backtracking. This can be done by finding specific structures in a transition diagram of an automaton [2, 18, 29, 34, 36, 37]. Besides, dynamic analysis, such as fuzzing [31], and combinations of static and dynamic analyses [19] are studied. ReDoS repair is a problem of modifying a given regex so that it does not cause ReDoS. Known solutions include exploring ReDoS-free regexes using SMT solvers [6, 21] and rule-based rewriting of vulnerable regexes [20]. These ReDoS detection and repair measures are computationally demanding, and their real-world deployment is limited.

There are other implementation-level studies on speeding up regex matching, such as Just-in-Time (JIT) compilation [17] and FPGA [32]. However, these studies are not intended to prevent catastrophic backtracking.

2 Preliminaries

We introduce preliminaries for this paper. Firstly, we present some basic concepts such as regexes, NFAs, conversion from regexes to NFAs, and backtracking matching. We then discuss catastrophic backtracking and the ReDoS vulnerability that it can cause. Finally, we introduce look-around and atomic grouping as practical regex extensions and NFAs with sub-automata for these extensions.

We fix a finite set \(\varSigma \) as an alphabet throughout this paper. We call sequences of elements of \(\varSigma \) strings. The empty string is denoted by \(\varepsilon \). For a string \(w = \sigma _0 \sigma _1 \dots \sigma _{n-1}\), the length of w, denoted by |w|, is defined as \(|w| = n\). We also write \(w[i] = \sigma _i\) for \(i \in \{ 0, \dots , n - 1 \}\).

We use partial functions for memoization. For two sets A and B, a partial function G from A to B, denoted by \(G:A \rightharpoonup B\), is defined as a function \(G:A \rightarrow B \cup \{ \bot \}\). Here \(\bot \) is the element for “undefined,” and it is assumed that \(\bot \not \in B\).

Let \(G:A \rightharpoonup B\) be a partial function, \(a \in A\), and \(b \in B\). We let \(G(a) \leftarrow b\) denote an updated partial function: it carries a to b, and any other \(x\in A\) to G(x) (it is undefined if G(x) is initially undefined).

2.1 Regexes

Regular expressions (regexes) are defined by the following abstract grammar.

$$\begin{aligned} \begin{array}{rlllllllllllllll} r :\,\!:=&{}\ \sigma &{}&{} \text {(a (literal) character, where } \sigma \in \varSigma \text { )} \quad &{} |&{}\ \varepsilon &{}&{} \text {(the empty string)} \\ |&{}\ r | r &{}&{} \text {(an alternation)} \quad &{} |&{}\ r\!\cdot \!r &{}&{} \text {(a concatenation)} \\ |&{}\ r^*&{}&{} \text {(a repetition)} \end{array} \end{aligned}$$

The concatenation operator \(\cdot \) may be omitted when there is no ambiguity. The precedence of operators is as follows: repetition, concatenation, and alternation. For example, \(ab^*| c\) means \((a \cdot (b^*)) | c\).

For a regex r, the size of r, denoted by |r|, is defined as follows: \(|\sigma | = |\varepsilon | = 1\), \(|(r_1|r_2)| = |r_1 \cdot r_2| = |r_1| + |r_2| + 1\), and \(|r^*| = |r| + 1\).

2.2 NFAs

A non-deterministic finite state automaton (NFA) is a quadruple \((Q, q_0, F, T)\), where Q is a finite set of states \(q_0 \in Q\) is an initial state, \(F \subseteq Q\) is a set of accepting states, and T is a transition function. For each \(q \in Q \setminus F\), T(q) can be one of the following: \(T(q) = \textsf{Eps}({q'})\), \(T(q) = \textsf{Branch}(q', q'')\), and \(T(q) = \textsf{Char}(\sigma , {q'})\) where \(q', q'' \in Q\) and \(\sigma \in \varSigma \).

The above definition of a transition function T is tailored to our purpose of backtracking. Compared to the common definition \(\delta :Q \times (\{ \varepsilon \} \cup \varSigma ) \rightarrow 2^Q\), it expresses general branching as combinations of certain elementary branchings. The latter is namely one transition by \(\varepsilon \), two transitions by \(\varepsilon \), and one transition by a certain character \(\sigma \in \varSigma \). This makes the description of backtracking matching easier. Note, in particular, that the successors \(q', q''\) in the branching \(\textsf{Branch}(q', q'')\) are ordered. Here, \(q'\) and \(q''\) are called the first and second successors, respectively. This definition of transition functions is similar to the op-codes of many real-world regex-matching implementations (cf. [8]).

Fig. 1.
figure 1

a conversion from regexes to NFAs

We present a conversion from regexes to NFAs (see Figure 1); it is similar to the Thompson–McNaughton–Yamada construction [23, 35]. For a regex r, \(\mathcal {A}(r)\) denotes the NFA \(\mathcal {A}\) converted from r. In the figure, labels on arrows show kinds of transitions. In a \(\textsf{Branch}\) transition, the top arrow points to the first successor, and the bottom points to the second successor. Rectangles indicate that the conversion is applied to sub-expressions inductively. Because each case of this construction introduces at most two new states, for a regex r and the NFA \(\mathcal {A}(r) = (Q, q_0, F, T)\), we have \(|Q| = O(|r|)\).

We collectively call \(\textsf{Eps}\) and \(\textsf{Branch}\) transitions \(\varepsilon \)-transitions. Later in this paper, if there are consecutive \(\varepsilon \)-transitions, they may be shown as a single transition in a figure. When a certain state returns to itself by \(\varepsilon \)-transitions, such a sequence of \(\varepsilon \)-transitions is called an \(\varepsilon \)-loop. \(\varepsilon \)-loops are problematic in matching because they cause infinite loops in matching.

An \(\varepsilon \)-loop can be detected during matching by recording a position on an input string when a state is visited. When an \(\varepsilon \)-loop is detected, several solutions exist to deal with it (see, e.g., [30]), such as treating an \(\varepsilon \)-loop as a failure (e.g., JavaScript and RE2) or treating it as a success but escaping it (e.g., Perl). These solutions can be easily adapted to our algorithms; therefore, for the simplicity of presentation, we introduce the following assumption.

Assumption 1

(no \(\varepsilon \)-loops). NFAs do not contain \(\varepsilon \)-loops.

2.3 Backtracking Matching

We present a basic backtracking matching algorithm for NFAs in Algorithm 1. It serves as a basis for optimization by memoization, both in [10] and in the current work.

The function \(\textsc {Match}_{\mathcal {A},w}\) is recursively called in this algorithm, but it must terminate on Asm. 1. It takes two parameters: \(\mathcal {A}\) is an NFA, and w is an input string. It also takes two arguments: \(q \in Q\) is the current state, and \(i \in \{ 0, \dots , |w| \}\) is the current position on w. \(\textsc {Match}_{\mathcal {A},w}({q_0, i})\) for an NFA \(\mathcal {A} = (Q, q_0, F, T)\) returns \(\textsf{SuccessAt}(i')\) with the matching position \(i' \in \{ 0, \dots , |w| \}\) if the matching with \(\mathcal {A}\) succeeds from i to \(i'\) on w, or returns \(\textsf{Failure}\) if the matching fails.

The \(\textsc {Match}\) function implements partial matching: given the position \(i\in \{0,\dotsc ,|w|\}\) of interest, one obtains, by running \(\textsc {Match}_{\mathcal {A},w}({q_0, i})\), one “matching position” \(i'\) (if it exists) such that \(w[i]\, w[i+1]\,\dotsc w[i']\) is accepted by \(\mathcal {A}\). Note the difference from total matching: given \(\mathcal {A}\) and w, it returns \(\textbf{true}\) if (the whole) w is accepted by \(\mathcal {A}\) and \(\textbf{false}\) otherwise. The practical relevance of partial matching must be clear, as we can use it for text search and replacement.

Lines 5 to 8 in Algorithm 1 perform matching for \(\textsf{Branch}\) transitions. Here, the algorithm first tries matching from the first successor \(q'\), and if that fails, it tries matching from the second successor \(q''\) with the same position. This behavior is called backtracking.

We define the regex partial matching problem using the function \(\textsc {Match}\).

Problem 1 (regex partial matching)

[regex partial matching]

figure c

Remark 1

One can say that the problem formulation is a bit strange. It requires, as output, a specific matching position chosen by a specific algorithm \(\textsc {Match}\), while a usual formulation would require an arbitrary matching position. We take this formulation since we aim to show that our optimization by memoization not only solves partial matching but also is consistent with an existing backtracking matching algorithm, in the sense we discussed in §1. We formulate consistency as correctness with respect to Prob. 1, that is, preserving the solution chosen by the specific algorithm \(\textsc {Match}\). We also note that the algorithm \(\textsc {Match}\) mirrors many existing implementations of regex matching (cf. §2.2).

2.4 Catastrophic Backtracking and ReDoS

In the execution of the \(\textsc {Match}\) function (Algorithm 1), depending on an NFA \(\mathcal {A}\) and an input string w, the number of recursive calls for the \(\textsc {Match}\) function may increase explosively, resulting in a very long matching time, as we will see in Example 1. This explosive increase in matching time is called catastrophic backtracking.

Fig. 2.
figure 2

the NFA \(\mathcal {A}({(a|a)}^*b)\)

Example 1 (catastrophic backtracking)

[catastrophic backtracking] Consider the NFA \(\mathcal {A} = \mathcal {A}({(a|a)}^*b) = (Q, q_0, F, T)\) shown in Figure 2, and let \(w = {"a^n c"}\) (the string repeating a of n times and ending with c) be an input string. \(\textsc {Match}_{\mathcal {A},w}({q_0, 0})\) invokes recursive calls \(O(2^n)\) times until returning \(\textsf{Failure}\). The reason for this recursive call explosion is to try all combinations on \(q_2\) to \(q_3\) and \(q_4\) to \(q_5\) transitions for each a in w during the matching.

Regexes denial of service (ReDoS) is a security vulnerability caused by catastrophic backtracking. In ReDoS, catastrophic backtracking causes a huge load on servers, making them unable to respond in a timely manner. There are cases of service outages due to ReDoS at Stack Overflow in 2016 [12] and at Cloudflare in 2019 [16]. Additionally, a 2018 study [33] reported that over 300 web services have potential ReDoS vulnerabilities. Thus, ReDoS is a widespread problem in the real world, and there is a great need for countermeasures.

According to a 2019 study [25], only 38% of developers are aware of ReDoS. This study also found that many developers find it difficult not only to read regexes but also to find and validate regexes to match their desires. It is mentioned in [25] that developers use Internet resources such as Stack Overflow to find regexes. In recent years, it has also become common to use generative AIs such as ChatGPT for such a purpose. However, when the authors asked, “Please suggest 10 regexes for validating email addresses” to ChatGPT,Footnote 4 2 of the 10 suggested regexes would cause ReDoS (see Table 1). Developers may unknowingly use such vulnerable regexes. For this reason, it is important to develop ReDoS countermeasures that can be achieved without the developer being aware of them.

Table 1. the regexes given by ChatGPT for the question “Please suggest 10 regexes for validating email addresses”.\(^{7}\)

\(^{7}\) The second and third regexes are the same; they are the actual output of ChatGPT.

Matching speed-up is a way to avoid causing ReDoS by ensuring that matching is linear in time to the length of an input string, freeing developers from worrying about ReDoS. A popular method for matching speed-up is using breadth-first search for non-deterministic transition instead of backtracking (depth-first search); it is called the on-the-fly DFA construction [7, 28]. However, since look-around and atomic grouping are extensions based on backtracking (see §2.6), it is not obvious that they can be supported by the on-the-fly DFA construction.

Memoization is another approach to ensuring linear-time backtracking matching; we pursue it in this paper.

2.5 Regex Extensions: Look-around and Atomic Grouping

Many real-world regexes come with various extensions for enhanced expressivity [13]. In this paper, we are interested in two classes of extensions, namely look-around and atomic grouping.

Look-around Look-around is a regex extension that allows constraints on strings around a certain position. It is also called zero-width assertion (e.g., in [10]) because it does not consume any characters. Look-around consists of four types: positive or negative, and look-ahead or look-behind.

Positive look-ahead is typically represented by the syntax (?=r); its matching succeeds when, reading ahead from the current position of the input string, the matching of the inner regex r succeeds. Note that the position for the overall matching does not change by the inner matching of r. For example, the regex /(?=bc)/ matches the string "abc" from position 1 (i.e., after the first character a) without consuming any characters.

The matching of a negative look-ahead (?!r) succeeds when the inner regex r is not matched.

Positive or negative look-behind—denoted by \(\texttt {(?<=}\) \(r\)) or \(\texttt {(?<!}\) \(r\)), respectively—is similar to the above, with the difference that the inner matching of r is performed backward, i.e., from right to left. For example, the regex \(\texttt {/(?<=ab)/}\) matches the string "abc" from position 2 (i.e., before the last character c) without consuming any characters.

A typical use of look-around is to put a look-behind before (or a look-ahead after) a regex r. This is useful when one wants to perform a search or replacement of r for only those occurrences that are in a certain context. For example, the regex matches only contents of the HTML \(\texttt {<p>}\) tag. As another example, common assertions such as (this matches the beginning of a string) and (this matches the end) can be expressed using look-around, namely and .

Atomic Grouping Atomic grouping is a regex extension that controls backtracking behaviors. It is designed to manually avoid problems caused by backtracking, such as catastrophic backtracking (§2.4).

Atomic grouping is represented by the syntax \(\texttt {(?>}\) \(r\)); once the matching of the inner regex r succeeds, the remaining branches in potential backtracking for matching r are discarded. For example, the regex /(a|ab)c/ matches the string "abc", but the regex \(\texttt {/(?>(a|ab))c/}\) using atomic grouping does not match it. This is because, once a in the atomic grouping matches the first character a of "abc", the remaining branch ab (in a|ab) is discarded, and one is left with the regex c and the string "bc".

Atomic grouping is often used for the purpose of preventing catastrophic backtracking. In that case, it is used in combination with the repetition syntax, e.g., \(\texttt {(?>(}\) \(r\)*)) (often abbreviated as r*+) and \(\texttt {(?>(}\) \(r\)+)) (abbrev. as r++). These abbreviations are called possessive quantifiers. The former (namely \(\texttt {(?>(}\) \(r\)*))) is intuitively understood as \(\texttt {(?>(}\) \(\varepsilon \)|r|rr|\(\dotsc \))), with the difference that longer matching is preferred (this is because the \(\textsf{Eps}\) loop is the first successor in Figure 1e). Once a longer match is found, the remaining branches (i.e., those for shorter matches) get discarded, thus preventing catastrophic backtracking.

One might wonder if our (linear-time and thus ReDoS-free) matching algorithm should support atomic grouping—the principal use of atomic grouping is to suppress backtracking and avoid ReDoS. We do need to support it since, as we discussed in §1, ours is meant to be a drop-in replacement for matching implementations that are currently used.

Our Target Extended Regexes Our target class, namely regexes with look-around and atomic grouping, is defined by the following grammar.

$$\begin{aligned} r :\,\!:=&\ \dots &\,& \text {(the same as the regexes definition,} \S 2.1) \\ |&\ ( \texttt {?=} r )\ |\ ( \texttt {?!} r ) &\,& \text {(positive and negative look-ahead)} \\ |&\ ( \texttt {?<=} r )\ |\ ( \texttt {?<!} r ) &\,& \text {(positive and negative look-behind)} \\ |&\ ( \texttt {?>} r ) &\,& \text {(atomic grouping)} \end{aligned}$$

For brevity, we sometimes refer to regexes with look-around and atomic grouping as (la, at)-regexes. We also refer to regexes with look-around as la-regexes and regexes with atomic grouping as at-regexes.

For a (la, at)-regex r, the size of r, denoted by |r|, is defined as the same as the regex one except for \(|( \texttt {?=} r )| = |( \texttt {?>} r) | = |r| + 1\).

Look-around is known to be regular: they can be converted to DFA, and the language family of la-regexes is the same as the regular language. This fact is mentioned in [3, 26, 27]. Atomic grouping is also known to be regular in the same sense [4]. However, it is known that look-ahead and atomic grouping can make the number of states of the corresponding DFA grow exponentially [3, 4, 26, 27].

In what follows, for simplicity, we only discuss positive look-ahead in discussions of look-around. Adaptation to other look-around operators, such as negative look-behind, is straightforward.

2.6 NFAs with Sub-automata

We introduce NFAs with sub-automata for backtracking matching algorithms for (la, at)-regexes. This extended notion of NFAs is suggested in [10, Section IX.B], but it seems ours is the first formal exposition.

Roughly speaking, an NFA with sub-automata is an NFA whose transitions can be labeled with—in addition to a character \(\sigma \in \varSigma \), as in usual NFAs—another NFA with sub-automata. See Figure 3, where transitions from \(q_{0}\) to \(q_{1}\) are labeled with , the NFA with sub-automata obtained by converting r. We annotate these transitions further with a label (\(\textsf{pla}\) for positive look-ahead, \(\textsf{at}\) for atomic grouping, etc.) that indicates which operator they arise from. Note that NFAs with sub-automata can be nested—transitions in in Figure 3 can be labeled with NFAs with sub-automata, too.

Our precise definition is as follows. There, P is the set that collects all states that occur in an NFA with sub-automata \(\mathcal {A}\), i.e., in 1) the top-level NFA, 2) its label NFAs, 3) their label NFAs, and so on.

Definition 1 (NFAs with sub-automata)

[NFAs with sub-automata] An NFA with sub-automata \(\mathcal {A}\) is a quintuple \(\mathcal {A}=(P, Q, q_0, F, T)\) where P is a finite set of states and \(Q \subseteq P\) is a set of so-called top-level states. We require that the quadruple \((Q, q_0, F, T)\) is an NFA, except that the value T(q) of the transition function T is either 1) \(\textsf{Eps}(q')\), \(\textsf{Branch}(q',q'')\), or \(\textsf{Char}(\sigma ,q')\) (as in usual NFAs, §2.2), or 2) \(\textsf{Sub}(k, \mathcal {A}', q')\), where \(\mathcal {A}'\) is an NFA with sub-automata, \(q'\) is a successor state, and k is a kind label where \(k\in \{\textsf{pla}, \textsf{nla}, \textsf{plb}, \textsf{nlb}, \textsf{at}\}\).

We further impose the following requirements. Firstly, we require all NFAs with sub-automata in \(\mathcal {A}\) to have disjoint state spaces. That is, for any distinct top-level states \(q,q''\in Q\) in \(\mathcal {A}\), if \(T(q) = \textsf{Sub}(k, \mathcal {A}', q')\) and \(T(q'') = \textsf{Sub}(k', \mathcal {A}'', q''')\), then we must have \(P' \cap P'' = \emptyset \), \(Q \cap P' = \emptyset \) and \(Q \cap P'' = \emptyset \), where \(\mathcal {A}' = (P', \dotsc )\) and \(\mathcal {A}'' = (P'', \dotsc )\). Secondly, we require that the set P in \(\mathcal {A}=(P,\dotsc )\) is the (disjoint) union of all states that occur within \(\mathcal {A}\), that is, \(P = Q \cup \bigcup _{q\in Q, T(q) = \textsf{Sub}(k, \mathcal {A}', q'), \mathcal {A}' = (P', \dotsc )} P'\).

The kind label k in \(\textsf{Sub}(k, \mathcal {A}', q'')\) indicates how the sub-automaton \(\mathcal {A}'\) should be used (cf. Algorithm 2). If every kind label occurring in \(\mathcal {A}\) (including its sub-automata) is either \(\textsf{pla}, \textsf{nla}, \textsf{plb}\), or \(\textsf{nlb}\), then \(\mathcal {A}\) is called a la-NFA. Similarly, if every kind label is \(\textsf{at}\), \(\mathcal {A}\) is called an at-NFA. Following this convention, general NFAs with sub-automata are called (la, at)-NFAs.

Note that the definition is recursive. Non-well-founded nesting is prohibited, however, by the finiteness of P. By the definition, if \(P = Q\), then \(\mathcal {A}\) does not contain any transitions labeled with sub-automata.

In addition to \(\textsf{Eps}\) and \(\textsf{Branch}\) transitions, we refer to \(\textsf{Sub}\) transitions with a label \(k \in \{ \textsf{pla}, \textsf{nla}, \textsf{plb}, \textsf{nlb} \}\) as \(\varepsilon \)-transitions too. We also assume the following, similarly to Asm. 1.

Assumption 2

(la, at)-NFAs do not contain \(\varepsilon \)-loops.

For (la, at)-regexes, their conversion to (la, at)-NFAs is described by the constructions in Figure 3—using transitions labeled with sub-automata—in addition to the conversion for regexes in §2.2. Note that we have \(|P| = O(|r|)\) in these constructions.

Fig. 3.
figure 3

a conversion from (la, at)-regexes to (la, at)-NFAs. For negative look-ahead, we use the corresponding kind label \(\textsf{nla}\). For positive and negative look-behind, besides using the kind labels \(\textsf{plb}\) and \(\textsf{nlb}\), we suitably reverse .

The backtracking matching algorithm in Algorithm 1 can be naturally extended to (la, at)-NFAs; it is shown in Algorithm 2. The clauses for positive look-ahead (Lines 12 to 16) and atomic grouping (Lines 17 to 21) are similar to each other, conducting matching for sub-automata. Note that their difference is in the “return position” (i in Line 15; \(i'\) in Line 20).

The clauses for other look-around operators are similar to the ones for positive look-around. For look-behind, we can suitably use an additional parameter \(d \in \{ -1, +1 \}\) for indicating a matching direction.

Using the extended backtracking matching algorithm (Algorithm 2), we define the partial matching problem for (la, at)-regexes in the same way as for regexes without extensions (Prob. 1).

Problem 2 ((la, at)-regex partial matching)

[(la, at)-regex partial matching]

figure l

3 Previous Works on Regex Matching with Memoization

This section introduces an existing work [10] on regex matching with memoization, paving the way for our algorithms for (la, at)-regexes in Sections 4 and 5.

Memoization is a programming technique that makes recursive computations more efficient by 1) recording arguments of a function and the corresponding return values and 2) reusing them when the function is called with the recorded arguments.

As we described in §2.3, regex matching is conducted by backtracking matching. It is implemented by recursive functions (see Algorithms 1 and 2); thus, it is a natural idea to apply memoization. Since Java 14, Java’s regex implementation has indeed used memoization for optimization. However, this optimization is not enough to completely prevent ReDoS; see, e.g., [24].

The work that inspires the current work the most is [10], whose main novelty is linear-time backtracking regex matching (much like the current work). Its contributions are as follows.

  1. 1.

    Focusing on (non-extended) regexes (see §2.1), they introduce a backtracking matching algorithm that uses memoization. It achieves a linear-time complexity: for an input string w, its runtime is O(|w|).

  2. 2.

    They introduce selective memoization, by which they reduce the domain of the memoization table from \(Q\times \mathbb {N}\) to \(Q_{\textsf {sel}}\times \mathbb {N}\). Here \(Q_{\textsf {sel}}\) is a subset of Q that is often much smaller.

  3. 3.

    They introduce a memory-efficient compression method—based on run-length encoding (RLE)—for memoization tables.

  4. 4.

    Finally, they discuss adaptations of the above method to extended regexes, namely REWZWA (the extension by look-around; look-around is called zero-width assertion in [10]) and REWBR (the extension by back-reference).

We will mainly discuss the above item 1; it serves as a basis for our algorithms in Sections 4 and 5. The technique in the item 2 is potentially very relevant: we expect that it can be combined with the current work; doing so is future work. The content of the item 2 is reviewed in [15, Appendix A] for the record.

Remark 2

On the above item 4, the work [10] claims that the time complexity of their algorithm is linear also for REWZWA (O(|w|) for an input string w). However, we believe that this claim comes with the following problems.

  • The description of an algorithm for REZWA in [10] is abstract and leaves room for interpretation. The description is to “preserve the memo functions of the sub-automata throughout the simulation of the top-level M-NFA, remembering the results from sub-simulations that begin at different indices i of w” [10, Section IX-B]. For example, it is not explicit what the “results” are—they can mean (complete) matching results or mere success/failure.

  • Moreover, the part “that begin at different indices i of w” is problematic; we believe that remembering these results does not lead to linear-time complexity. This point is discussed later in Remark 4.

  • Besides, there is a gap between the algorithm described in the paper [10] and its prototype implementation [11], even for (non-extended) regexes. See Remark 3.

  • Because of this gap, the implementation [11] works in linear time for all regexes, including REZWA, but can lead to erroneous results for REZWA. See Remark 4.

Our contribution includes a correct memoization algorithm for look-around (REZWA) that resolves the above problems.

3.1 Linear-time Backtracking Matching with Memoization

We describe the first main contribution of the work [10] (the item 1 in the above list), namely a backtracking algorithm that achieves a linear-time complexity thanks to memoization. The algorithm [10, Listing 2] is presented in Algorithm 3.

Fig. 4.
figure 4

the NFA \(\mathcal {A}({(aa|aa)}^*b)\), after removing \(\varepsilon \)-transitions

In this algorithm \(\textsc {DavisSL}^{M}_{\mathcal {A}, w}\), an NFA \(\mathcal {A}\) is a quintuple \((Q, q_0, F, \delta )\) where \(\delta :Q \times \varSigma \rightarrow 2^Q\) is an non-deterministic transition function. An additional parameter \(M:Q \times \mathbb {N} \rightharpoonup \{ {\textbf {false}} \}\) is a memoization table, which is mathematically a mutable partial function. This algorithm implements total matching (cf. §2.3). It is notable that the memoization table records only matching failures: a matching success does not have to be recorded since it immediately propagates to the success of the whole problem.

This algorithm achieves a linear-time matching. It thus prevents ReDoS. A full proof of linear-time complexity is found in [10, Appendix C], but its essence is the following (note the critical role of memoization here).

  • For any call \(\textsc {DavisSL}^M_{\mathcal {A}, w}({q, i})\), if M(qi) is defined, then the call does not invoke any further recursive calls.

  • When such a call returns \(\textbf{false}\), the entry M(qi) of the memoization gets defined (Line 7).

  • As a consequence, the number of recursive calls of \(\textsc {DavisSL}^{M}_{\mathcal {A}, w}\) is limited to \(|Q|\times |w|\).

Example 2

(matching with memoization for NFAs without \(\varepsilon \)-transitions). Let us consider the regex \((aa|aa)^*b\) and the corresponding NFA \(\mathcal {A}((aa|aa)^*b)\) defined in §2.2. For the purpose of applying Algorithm 3, we manually remove its \(\varepsilon \)-transitions, leading to the NFA in Figure 4. Let \(w = {"a^{2n} c"}\) be an input string. \(\textsc {Match}_{\mathcal {A},w}({q_0, 0})\) (without memoization) invokes recursive calls \(O(2^n)\) times for the same reason as in Example 1, but \(\textsc {DavisSL}^{M_0}_{\mathcal {A}, w}(q_0, 0)\) (with memoization, where \(M_0\) is the initial memoization table) invokes recursive calls O(n) times because \(M(q_0,i)\) for each position \(i\in \{ 0, 2, \dots , 2n \}\) has been recorded after the first visit.

Remark 3

Following the discussion in Remark 2, here we describe the gap between Algorithm 3—the algorithm described in the paper [10]—and its prototype implementation [11]. The latter is shown in Algorithm 4.

The precise difference between the two algorithms is that Line 7 in Algorithm 3 is moved up to the moment just before the for-loop, in Algorithm 4. It is not hard to see that this modification does not affect the correctness of the algorithm: if the pair (qi) is visited again in the future, it means that the current matching from (qi) did not succeed, and backtracking occurred. Note that, in case the current matching is successful, the function call returns \(\textbf{true}\) so the memoization content M(qi) should not matter.

However, the above argument is true only when there is no look-around. (A detailed discussion is in Example 3.) This point seems to be missed in the implementation [11].

3.2 Matching with Memoization Adapted to the Current Formalism

In Algorithm 5, we present an adaptation of Algorithm 3 to our formalism, especially our definition of NFA (§2.2) that offers fine-grained handling of nondeterminism. Algorithm 5 has been adapted also to solve partial matching (it returns a matching position \(i'\)) rather than total matching as in Algorithm 3 (cf. §2.3). Algorithm 5 serves as a basis towards our extensions to look-around and atomic grouping in Sections 4 and 5.

The adaptation is straightforward: Line 5 ensures that the algorithm solves partial matching; the rest is a natural adaptation of the for-loop of Algorithm 3 to our definition of NFA (§2.2). The algorithm terminates thanks to Asm. 1. We note that the type of memoization tables does not have to be changed compared to Algorithm 3.

Algorithm 5 exhibits the same desired properties as Algorithm 3, namely correctness (with respect to Prob. 1) and linear-time complexity. We formally state these properties for the record; here, \(M_0:Q \times \mathbb {N} \rightharpoonup \{ \textsf{Failure} \}\) is the initial memoization table (its entry is anywhere \(\bot \)).

Theorem 1 (linear-time complexity of Algorithm 5)

[linear-time complexity of Algorithm 5] For an NFA \(\mathcal {A} = (Q, q_0, F, T)\), an input string w, and an position \(i \in \{ 0, \dots , |w| \}\), \(\textsc {Memo}^{M_0}_{\mathcal {A}, w}({q_0, i})\) terminates with O(|w|) recursive calls.

Theorem 2 (correctness of Algorithm 5)

[correctness of Algorithm 5] For an NFA \(\mathcal {A} = (Q, q_0, F, T)\), an input string w, and an position \(i \in \{ 0, \dots , |w| \}\), \(\textsc {Match}_{\mathcal {A},w}({q_0, i}) = \textsc {Memo}^{M_0}_{\mathcal {A}, w}({q_0, i})\).

The proofs can be found in [15, Appendix B.1]. Here is their outline.

We first introduce the notion of run for \(\textsc {Match}\) and \(\textsc {Memo}\); it records recursive calls of the function itself, as well as invocations of the memoization table, together with their return values.

For linear time complexity (Thm. 1), we show that 1) a recursive call with the same argument (qi) appears at most once in a run, and that 2) the number of invocations of the memoization table with the same key (qi) is bounded by the (graph-theoretic) in-degree. Linear-time complexity then follows easily.

For correctness (Thm. 2), we introduce a conversion from runs of \(\textsc {Memo}\) to runs of \(\textsc {Match}\). By showing that 1) the result is indeed a valid run of \(\textsc {Match}\) and 2) the conversion preserves return values, we show the coincidence of the return values of the two algorithms, i.e., correctness.

4 Memoization for Regexes with Look-around

We describe our first main technical contribution, namely a backtracking matching algorithm for la-NFAs with memoization (Algorithm 6). We prove that it is correct (Thm. 4) and that its time complexity is linear (O(|w|), Thm. 3).

The key ingredient of our algorithm is the type of memoization tables, where their range is extended from \(\{\textsf{Failure}\}\) to \(\{\textsf{Failure}, \textsf{Success}\}\). We motivate this extension through two problematic algorithms \(\textsc {MemoExit}{\mathrm {\text {-}la}}\) and \(\textsc {MemoEnter}{\mathrm {\text {-}la}}\); \(\textsc {MemoExit}{\mathrm {\text {-}la}}\) is obtained by naively extending Algorithm 5 (\(\textsc {Memo}\)) with adding the processing of sub-automaton transitions with \(\textsf{pla}\) (positive look-ahead) done in Algorithm 2 (Lines 12 to 16), and \(\textsc {MemoEnter}{\mathrm {\text {-}la}}\) is similar to \(\textsc {MemoExit}{\mathrm {\text {-}la}}\), but this records to the memoization table at the same timing as Algorithm 4 (\(\textsc {DavisSLImpl}\)). In particular, their memoization tables only record \(\textsf{false}\).

The example below shows the problems with the two naive algorithms. Specifically, \(\textsc {MemoExit}{\mathrm {\text {-}la}}\) is not linear and \(\textsc {MemoEnter}{\mathrm {\text {-}la}}\) is not correct.

Fig. 5.
figure 5

the la-NFA \(\mathcal {A}(((\texttt {?=} a^*) a)^*)\)

Fig. 6.
figure 6

the at-NFA \(\mathcal {A}(a^*(\texttt {?>} a^*) ab)\)

Example 3

Consider the la-NFA \(\mathcal {A} = \mathcal {A}(((\texttt {?=} a^*) a)^*) = (P, Q, q_0, F, T)\) shown in Figure 5, and let \(w = {"a^n"}\) be an input string.

\(\textsc {MemoExit}\mathrm {\text {-}la}^{M_0}_{\mathcal {A}, w}({q_0, 0})\) invokes recursive calls \(O(|w|^2)\) times—in the same way as \(\textsc {Match}\mathrm {\text {-}(la,at)}\)—because there are no matching failures in \(\mathcal {A}'\) that contribute to memoization.

We also see \(\textsc {MemoEnter}{\mathrm {\text {-}la}}\) is not correct: \(\textsc {Match}\mathrm {\text {-}(la,at)}_{\mathcal {A},w}(q_0 ,0)\) returns \(\textsf{SuccessAt}(n)\), but \(\textsc {MemoEnter}\mathrm {\text {-}la}^{M_0}_{\mathcal {A}, w}({q_0, 0})\) returns \(\textsf{SuccessAt}(1)\) because \(M(q_5, 1)=\textbf{false}\) is recorded during the first loop and interpreted as a matching failure.

In Example 3, a natural solution to the non-linearity issues with \(\textsc {MemoExit}{\mathrm {\text {-}la}}\) is to enrich memoization so that it also records previous successes of look-around. Furthermore, since matching positions do not matter in look-around, the type of memoization tables should be \(M:P \times \mathbb {N} \rightharpoonup \{ \textsf{Failure}, \textsf{Success} \}\).

Remark 4

The work [10, Section IX-B] proposes an adaptation of their memoization algorithm to REZWA. Its description in [10, Section IX-B] (to “preserve the memo functions\(\dotsc \)”; see Remark 2) consists of the following two points:

  1. 1.

    preserving the memoization tables of the sub-automata throughout the whole matching, and

  2. 2.

    recording the results of sub-automata matching from different start positions i of w.

The naive algorithm \(\textsc {MemoExit}{\mathrm {\text {-}la}}\) we discussed above implements the first point. We can further add the second point (that is essentially “memoization for sub-automaton matching”) to \(\textsc {MemoExit}{\mathrm {\text {-}la}}\).

However, we find that this is not enough to ensure linear-time complexity. The problem is that the “memoization for sub-automaton matching” is used too infrequently. For example, in Example 3, the start positions of sub-automaton matching are different each time; thus, the memoized results are never used.

Our algorithm (Algorithm 6) resolves this problem by letting the memoization tables (for sub-automaton matching) record results not only for starting positions but also for non-starting positions.

We also note that there is a gap between the algorithm in the paper [10] and its prototype implementation [11]; see Remark 3. The latter is linear time but not always correct. For example, in Example 3, the correct result is \(\textsf{SuccessAt}(n)\), but the prototype [11] returns \(\textsf{SuccessAt}(1)\), similarly to \(\textsc {MemoEnter}{\mathrm {\text {-}la}}\).

Algorithm 6 is the matching algorithm for la-NFAs that we propose. It adopts the above extended type of M. In Line 18 , \(\textsf{Success}\) is recorded in the memoization table when the matching succeeded. This function can return one of \(\textsf{SuccessAt}(i')\), \(\textsf{Failure}\), and \(\textsf{Success}\). We first prove the following lemma to see that the algorithm indeed solves the partial matching problem (Prob. 2).

Lemma 1

For a la-NFA \(\mathcal {A} = (P, Q, q_0, F, T)\), an input string w, and a position \(i \in \{ 0, \dots , |w| \}\), \(\textsc {Memo}\mathrm {\text {-}la}^{M_0}_{\mathcal {A}, w}({q_0, i})\) returns either \(\textsf{SuccessAt}(i')\) for \(i' \in \{ 0, \dots , |w| \}\) or \(\textsf{Failure}\) (it does not return \(\textsf{Success}\)).

Proof

When we obtain \(\textsf{Success}\) as a return value, it must be via an entry M(qi) of the memoization table. However, due to Asm. 2, when M(qi) is set to \(\textsf{Success}\) for a state q of the top-level automaton of \(\mathcal {A}\), the matching is already finished and returns \(\textsf{SuccessAt}(i')\).    \(\square \)

As a consequence of the lemma, we can further shrink the memoization tables in Algorithm 6 by not recording \(\textsf{Success}\) for M(qi) where q is a state of the top-level automaton.

Algorithm 6 exhibits the desired properties, namely correctness (with respect to Prop. 2) and linear-time complexity.

Theorem 3 (linear-time complexity of Algorithm 6)

[linear-time complexity of Algorithm 6] For a la-NFA \(\mathcal {A} = (P, Q, q_0, F, T)\), an input string w, and a position \(i \in \{ 0, \dots , |w| \}\), \(\textsc {Memo}\mathrm {\text {-}la}^{M_0}_{\mathcal {A}, w}({q_0, i})\) terminates with O(|w|) recursive calls.

Theorem 4 (correctness of Algorithm 6)

[correctness of Algorithm 6] For a la-NFA \(\mathcal {A} = (P, Q, q_0, F, T)\), an input string w, and a position \(i \in \{ 0, \dots , |w| \}\), \(\textsc {Match}\mathrm {\text {-}(la,at)}({q_0, i}) = \textsc {Memo}\mathrm {\text {-}la}^{M_0}_{\mathcal {A}, w}({q_0, i})\).

Thm. 3 and 4 can be shown similarly to Thm. 1 and 2; see [15, Appendix B.2]. The in-degree for sub-automata requires some additional care.

5 Memoization for Regexes with Atomic Grouping

We describe our second main technical contribution, namely a backtracking matching algorithm for at-NFAs with memoization (Algorithm 7). We prove that it is correct (Thm. 6) and that its time complexity is linear (O(|w|), Thm. 5).

The key ingredient of our algorithm is the type of memoization tables, where their range is extended from \(\{\textsf{Failure}\}\) to \(\{ \textsf{Failure}(j)\mid j \in \{ 0, \dots , \nu (\mathcal {A}_0) \} \}\); the latter records a depth j of atomic grouping in order to distinguish failures of different depths. We motivate this extension through two problematic algorithms \(\textsc {MemoExit}{\mathrm {\text {-}at}}\) and \(\textsc {MemoEnter}{\mathrm {\text {-}at}}\). Much like in §4, \(\textsc {MemoExit}{\mathrm {\text {-}at}}\) naively extends Algorithm 5 (\(\textsc {Memo}\)) by adding the processing of sub-automaton transitions with \(\textsf{at}\) done in Algorithm 2 (Lines 17 to 21), and \(\textsc {MemoEnter}{\mathrm {\text {-}la}}\) is similar to \(\textsc {MemoExit}{\mathrm {\text {-}at}}\), but records to the memoization table at the same timing as Algorithm 4 (\(\textsc {DavisSLImpl}\)).

Firstly, we observe that \(\textsc {MemoExit}{\mathrm {\text {-}at}}\) is not linear for a reason similar to Example 3. (A concrete example is given by Example 4.) Therefore, we turn to the other candidate, namely \(\textsc {MemoEnter}{\mathrm {\text {-}at}}\).

We find, however, that \(\textsc {MemoEnter}{\mathrm {\text {-}at}}\) is also problematic. It is not correct.

Example 4

Consider the at-NFA \(\mathcal {A} = \mathcal {A}(a^*(\texttt {?>} a^*) ab) = (P, Q, q_0, F, T)\) shown in Figure 6, and let \(w = {"a^n b"}\) be an input string. \(\textsc {Match}\mathrm {\text {-}(la,at)}_{\mathcal {A},w}(q_0, 0)\) returns \(\textsf{Failure}\)—the atomic grouping \((\texttt {?>} a^*)\) consumes all a’s in w and no a is left for the final ab pattern—but \(\textsc {MemoEnter}\mathrm {\text {-}at}^{M_0}_{\mathcal {A}, w}(q_0, 0)\) returns \(\textsf{SuccessAt}(n+1)\). Thus \(\textsc {MemoEnter}{\mathrm {\text {-}at}}\) is wrong.

For both algorithms, the state \(q_7\) in the \(\textsf{at}\) transition is first reached at position \(i=n\), and then backtracking is conducted, leading to the state \(q_7\) again at \(i=n-1\). The execution of \(\textsc {MemoEnter}{\mathrm {\text {-}at}}\) proceeds as follows.

  • The first execution path consumes all a’s in the loop from \(q_0\) to \(q_2\), reaches \(q_7\) with \(i=n\), eventually leading to failure at \(q_4\) and thus to backtracking. Speculative memoization (\(M(q, i) \leftarrow \textbf{false}\) in Algorithm 4) is conducted in its course; in particular, \(M(q_7, n) = \textbf{false}\) is recorded.

  • After backtracking, the second execution path reaches \(q_7\) with \(i=n-1\); it then visits \(q_8\) once and reaches \(q_7\) with \(i=n\). Now it uses the memoized value \(M(q_7, n) = \textbf{false}\) (cf. Line 4 of Algorithm 4), leading to backtracking to \(q_7\) with \(i=n-1\). It then takes the branch to \(q_{10}\), and the matching for \(\mathcal {A}'\) succeeds. Therefore, the execution reaches \(q_4\) with \(i=n-1\), and the whole matching succeeds.

The last example shows the challenge we are facing, namely the need of distinguishing failures of different depths. Specifically, in the previous example, the memoized value \(M(q_7, n) = \textbf{false}\) comes from the failure of matching for ambient \(\mathcal {A}\); still, it is used to control backtracking in the sub-automaton \(\mathcal {A}'\). This fact is problematic in an atomic grouping where, roughly speaking, backtracking in an ambient automaton should not cause backtracking in a sub-automaton. Atomic grouping can be nested, so we must track at which depth failure has happened.

Definition 2 (nesting depth of atomic grouping)

[nesting depth of atomic grouping] For an at-NFA \(\mathcal {A} = (P, Q, q_0, F, T)\) and a state \(q \in P\), the nesting depth of atomic grouping for q, denoted by \(\nu _\mathcal {A}(q)\), is

$$\begin{aligned} \begin{array}{l} \nu _\mathcal {A}(q) = {\left\{ \begin{array}{ll} 0 &{} \text {if } q \in Q \\ 1 + \nu _{\mathcal {A}'}(q) &{} \text {where } \mathcal {A}' = (P', Q', q_0', F', T') \\ &{} \text {s.t. } T(q') = \textsf{Sub}(\textsf{at}, \mathcal {A}', q'') \text { and } q \in P' \text { .} \end{array}\right. } \end{array} \end{aligned}$$

We also define the maximum nesting depth of atomic grouping for \(\mathcal {A}\), denoted by \(\nu (\mathcal {A})\), as \(\nu (\mathcal {A}) = \max _{q \in P} \nu _\mathcal {A}(q)\).

Algorithm 7 is our algorithm for at-NFAs; the type of its memoization tables is \(M:P \times \mathbb {N} \rightharpoonup \{ \textsf{Failure}(j)\mid j \in \{ 0, \dots , \nu (\mathcal {A}) \} \}\). Some remarks are in order.

Note first that the algorithm takes, as its parameters, the whole at-NFA \(\mathcal {A}_{0}\) and its sub-automaton \(\mathcal {A}\) as the algorithm’s current scope. The top-level call is made with \(\mathcal {A}_{0}=\mathcal {A}\) (cf. Thm. 5 and 6); when an \(\textsf{at}\) transition is encountered, the scope goes to the corresponding sub-automaton (\(\mathcal {A}'\) in Line 17).

In Line 9, the \(\textbf{if}\) condition checks that the nesting depth of \(\textsf{Failure}\) is the depth of the current NFA, and backtracking is performed if and only if it is true. This approach is crucial for avoiding the error in Example 4. The rest of the cases for \(\textsf{Eps}, \textsf{Branch},\textsf{Char}\) is similar to Algorithm 5.

The case for \(\textsf{Sub}\) (Lines 15–23) requires some explanation. It is an adaptation of Lines 17–21 of Algorithm 5 with memoization. The apparent complication comes from the set K in \(\textsf{SuccessAt}(i',K)\). The set K is a set of keys for a memoization table M, that is, pairs (qi) of a state and a position. The role of K is to collect the set of keys of M for which, once failure happens, the entry \(\textsf{Failure}(j)\) has to be recorded (this is done in a batch manner in Line 22). More specifically, once failure happens in an outer automaton (i.e., at a smaller depth j), this has to be recorded as \(\textsf{Failure}(j)\) for inner automata (at greater depths). The set K collects those keys for which this has to be done, starting from inner automata (\(\mathcal {A}'\), Line 18) and going to outer ones (\(\mathcal {A}\), Lines 19–20).

A closer inspection reveals that Line 20 is vacuous in Algorithm 7; however, it is needed when we combine it with look-around at the end of the section.

Algorithm 7 exhibits the desired properties, namely correctness (with respect to Prop. 2) and linear-time complexity. In Thm. 6, f is a function that converts results of Algorithm 7 to results of Algorithm 2; it is defined by \(f(\textsf{Failure}(j)) = \textsf{Failure}\) and \(f(\textsf{SuccessAt}(i', K)) = \textsf{SuccessAt}(i')\).

Theorem 5 (linear-time complexity of Algorithm 7)

[linear-time complexity of Algorithm 7] For an at-NFA \(\mathcal {A} = (P, Q, q_0, F, T)\), an input string w, and an position \(i \in \{ 0, \dots , |w| \}\), \(\textsc {Memo}\mathrm {\text {-}at}^{M_0}_{\mathcal {A}, \mathcal {A}, w}({q_0, i})\) terminates with O(|w|) recursive calls.

Theorem 6 (correctness of Algorithm 7)

[correctness of Algorithm 7] For an at-NFA \(\mathcal {A} = (P, Q, q_0, F, T)\), an input string w, and an position \(i \in \{ 0, \dots , |w| \}\), \(\textsc {Match}\mathrm {\text {-}(la,at)}({q_0, i}) = f(\textsc {Memo}\mathrm {\text {-}at}^{M_0}_{\mathcal {A}, \mathcal {A}, w}({q_0, i}))\).

Thm. 5 and 6 are proved similarly to Thm. 1 and 2; see [15, Appendix B.3]. The following points require some extra care.

Firstly, for linear-time complexity (Thm. 5), there is another recursive call (Line 19) before the return value of a recursive call (Line 17) is memoized (Line 22). If the second recursive call (Line 19) eventually leads to (the same call as) the first call (Line 17) (let’s call this event \((*)\)), then this can nullify the effect of memoization. We prove, as a lemma, that \((*)\) never happens.

Secondly, for correctness (Thm. 6), our conversion of runs should replace an invocation of the memoization table—if it returns a failure with a shallower depth—with not only the corresponding run (as before) but also the run of the second recursive call (Line 19). See [15, Appendix B.3] for details.

Combination with Look-around It is also possible to combine with Algorithm 6 (for look-around) and Algorithm 7 (for atomic grouping). In this case, the type of memoization tables becomes \(M:P \times \mathbb {N} \rightharpoonup \{ \textsf{Failure}(j)\mid j \in \{ 0, \dots , \nu (\mathcal {A}) \} \} \cup \{ \textsf{Success} \}\) and nesting depths of the atomic group are reset by look-around operators. A complete algorithm can be found in [15, Appendix C]; it also exhibits the desired properties.

6 Experiments and Evaluation

Implementation We implemented the algorithm proposed in this paper for evaluation. We call our implementation memo-regex. It is written in 1368 lines of Scala.

memo-regex supports both look-around (i.e., look-ahead and look-behind) and atomic grouping. We implemented a regex parser ourselves. Backtracking is implemented by managing a stack manually rather than using a recursive function to prevent stack overflow. In this case, the memoization keys are pushed onto the stack. Recoding these keys in a memoization table is done during backtracking. We used the mutable HashMap from the Scala standard library as a data structure for memoization tables.

memo-regex also supports capturing sub-matchings. However, this feature cannot be used within atomic grouping and positive look-around because sub-matching information is lost for memoization.

The code of memo-regex, as well as all experiment scripts, is available [14].

Efficiency of Our Algorithm We conducted experiments to assess the performance of our memo-regex, in particular in comparison with other existing implementations.

Table 2. our benchmark regexes and input strings

As target regexes, we looked for those with look-around and/or atomic grouping in the real-world regexes posted on regexlib.com. We then identified—by manual inspection—four regexes \(r_1,\dotsc , r_4\) that are subject to potential catastrophic backtracking. These regexes are shown in Table 2. We then crafted input strings \(w_1, \dotsc , w_4\), respectively, so that they cause catastrophic backtracking. Specifically, \(r_1\) contains positive look-ahead and negative look-ahead; this positive look-ahead is used for restricting the length of input strings. The regexes \(r_2\) and \(r_3\) are themselves positive look-ahead and look-behind, respectively; both include negative look-ahead, too. The regex \(r_4\) includes atomic grouping and negative look-ahead.

For these regexes, we measured matching time using memo-regex on OpenJDK 21.0.1. We compared it with the following implementations: Node.js 20.5.0, Ruby 3.1.4, and PCRE2 10.42 (used by PHP 8.3.1, w/ or w/o JIT). All of these implementations use backtracking; Ruby and PCRE2 have restrictions on regexes inside look-behind and Node.js does not support atomic grouping. The experiments were performed 10 times and the average was adopted. Furthermore, for memo-regex, we measured the size of its memoization table by the memory usage, using jamm.Footnote 5 The experiments were conducted on MacBook Pro 2021 (Apple M1 Pro, RAM: 32 GB).

Fig. 7.
figure 7

result for \(r_1\)

Fig. 8.
figure 8

matching time for \(r_2, r_3\) and \(r_4\)

We show the results in Figures 7 and 8. Note that the values of n are different depending on whether the matching time complexity is \(O(n^2)\) or \(O(2^n)\). Results for some implementations are absent for \(r_{3}\) and \(r_4\) because of the syntactic restrictions discussed above.

In Figures 7 and 8, we observe clear performance advantages of memo-regex. In particular, its linear-time complexity and linear memory consumption (memoization table size) are experimentally confirmed.

Real-world Usage of Look-around and Atomic Grouping We additionally surveyed the use of the regex extensions of our interest, in order to confirm their practical relevance.

We used a regex dataset collected by a 2019 survey [9]. This dataset contains 537,806 regexes collected from the source code of real-world products.

We tallied the usage of each regex extension by parsing these regexes in the dataset with our parser in memo-regex. 8,679 regexes could not be parsed or compiled; this is due to back-reference for 4,360 regexes, unsupported syntax (Unicode character class, conditional branching, etc.) for 4,134 regexes, and too large or semantically invalid regexes for the other 184 regexes. We adopted the remaining 529,127 regexes for tallying.

Table 3. regex ext. usage

The result is shown in Table 3. Note that 1) the numbers for look-ahead and look-behind do not include simple zero-width assertions such as \(\mathtt {\hat{\,}}\) (line-begin) or $ (line-end), and 2) that of atomic grouping includes possessive quantifiers such as *+ and ++.

In Table 3, we observe that 17,167 regexes (3.2%) in the dataset use at least one of the extensions we studied in this paper. While the ratio is not very large, the absolute number (17,167 regexes) is significant; this implies that there are a number of applications (such as web services) that rely on the regex extensions. Thereby we confirm the practical relevance of these regex extensions.

7 Conclusions and Future Work

In this paper, we proposed a backtracking algorithm with memoization for regexes with look-around and atomic grouping. It is the first linear-time backtracking matching algorithm for such regexes. It also fixs problems of the memoization matching algorithm in [10] for look-ahead. We implemented the algorithm; our experimental evaluation confirms its performance advantage.

One direction of future work is to support more extensions. Our implementation does not support a widely used regex extension, namely back-references. In the recent work [10], back-reference was supported by additionally recording captured positions in memoization tables. We expect that a similar idea is applicable to our algorithm.

Combination with selective memoization (used in [10]; see [15, Appendix A]) is another direction. We believe it is possible, but it will require a more detailed discussion on how to handle sub-automata in the selective memoization schema.