Ensuring the Correctness of Regular Expressions: A Review

Regular expressions are widely used within and even outside of computer science due to their expressiveness and flexibility. However, regular expressions have a quite compact and rather tolerant syntax that makes them hard to understand, hard to compose, and error-prone. Faulty regular expressions may cause failures of the applications that use them. Therefore, ensuring the correctness of regular expressions is a vital prerequisite for their use in practical applications. The importance and necessity of ensuring correct definitions of regular expressions have attracted extensive attention from researchers and practitioners, especially in recent years. In this study, we provide a review of the recent works for ensuring the correct usage of regular expressions. We classify those works into different categories, including the empirical study, test string generation, automatic synthesis and learning, static checking and verification, visual representation and explanation, and repairing. For each category, we review the main results, compare different approaches, and discuss their advantages and disadvantages. We also discuss some potential future research directions.


Introduction
Regular expressions are widely used within and even outside of computer science due to their expressiveness and flexibility. The importance of regular expressions for constructing the scanners of compilers is well known [1] . Nowadays, their applications extend to more areas such as network protocol analysis [2] , MySQL injection prevention [3] , network intrusion detection [4] , XML data specification [5] , and database querying [6] , or more diverse applications like DNA sequence alignment [7] . Regular expressions are commonly used in computer programs for pattern searching and string matching. They are a core component of almost all modern programming languages, and frequently appear in software source codes. Studies have shown that more than a third of JavaScript and Python projects contain at least one regular expression [8,9] .
However, recent research has found that regular expressions are hard to understand, hard to compose, and error-prone [10,11] . Indeed, regular expressions have a quite compact and rather tolerant syntax that makes them hard to understand even for very short regular expressions. For example, it is not easy for users to capture immediately what strings the regular expression "\[( [^\]]+)\]|\(([^\)]+)\)" specifies. It becomes much more difficult for complex regular expressions containing more than 100 characters or may have more than ten nested levels [12] . This is a real situation for software developers. For example, on the popular website stackoverflow.com, where developers learn and share their programming knowledge, more than 235 000 questions are tagged with "regex".
Faulty regular expressions may cause failures in the corresponding applications that use them. Therefore, ensuring the correctness of regular expressions is a vital prerequisite for their use in practical applications. In fact, the importance of ensuring the correctness of regular expressions or other structural description models has already been recognized by some researchers. Klint et al. [13] used the term "grammarware" to refer to all software that involves grammar knowledge in an essential manner. Here, grammar is meant in the sense of all structural descriptions or descriptions of structures used in software systems, including regular expressions, context-free grammars, etc. They noted that "In reality, grammarware is treated, to a large extent, in an ad-hoc manner with regard to design, implementation, transformation, recovery, testing, etc." Take the testing of regular expressions as an example: A survey of professional developers reveals that developers test their regular expressions less than the rest of their codes [8] . Indeed, an empirical study shows that about 80% of the regular expressions used in practical projects are not tested, and among those tested, about half use only one test string that is far from sufficient [14] . Hence, sound and systematical methods and techniques are necessary to improve the quality of such software components.
The importance and necessity of checking the correctness and thus improving the quality of regular expressions have attracted extensive attention from researchers and practitioners, especially in recent years. In this article, we provide a review of the recent works related to this issue. We classify the related works into different categories, including empirical study, test string generation, automatic synthesis and learning, static checking and verification, visual representation and explanation, and repairing. For each category, we review the main results, compare different approaches, and discuss their advantages and disadvantages.
The rest of this article is organized as follows: Section 2 introduces the preliminary knowledge on regular expressions, their different dialects, the meaning of correctness and finite automata. Sections 3−8 review the relevant works on the correctness assurance of regular expressions according to different categories, respectively. Section 9 concludes with a summary and a discussion for further work.

Formal definition
Regular expressions arose in the context of the formal language theory [15] . Simply speaking, a regular expression is a sequence of characters that defines a (possibly infinite) set of strings, which is called the language described by the regular expression. Formally, a regular expression defined on an alphabet Σ of symbols is described recursively as follows. The empty set Ø, the empty string ε and each symbol a ∈ Σ are regular expressions, denoting the empty set Ø, the set containing only the "empty" string, the set containing only the character a, respectively. Suppose that R and S are two regular expressions, then the concatenation RS, the alternation R|S, and the Kleene star R * are regular expressions, denoting the set of strings that can be obtained by concatenating a string described by R and a string described by S (in that order), the union of sets of strings described by R and S, and the set of all strings that can be obtained by concatenating any finite number (including zero) of strings from the set described by R, respectively. A string w belonging to the language defined by regular expression R is called positive or accepted by R, otherwise called negative or refused by R. For a given language, there exist many corresponding regular expressions that can describe the language.

Different dialects
Apart from the above standard regular operators, ex-tensions have been added to regular expressions to enhance their ability to specify string patterns. For example, operators +,?, {m,n} and && are used for specifying one or more repetition, one or zero repetition, repetition for at least m times and at most n times, interleaving or shuffling [16] of strings, respectively. Different applications may support different additional operators. For instance, XML schema language DTD permits only additional operators + and ?, while XSD [17] further supports operator {m,n}. The XML schema language Relax NG [18] even further allows the use of an interleaving operator that specifies unordered concatenations of strings.
In particular, in the field of string pattern matching, more additional notions and more compact syntax are used to describe string patterns, such as the character class, character range, complement, and wildcard. Different regular expression pattern matching engines are not fully compatible with one another. The syntax and behavior of a particular regular expression engine may differ from the others. Although Perl compatible regular expressions (PCRE) [19] and portable operating system interface of UNIX (POSIX) regular expressions [20] have greatly influenced the features of most regular expression engines, they have not been standardized yet. Thus, a lot of regular expression dialects exist in practical uses. Some regular expression engines even contain non-regular operators, such as backreferences or look-around assertions. For convenience, Table 1 illustrates the syntax and operators supported in most regular expression dialects.
Among the existing works on ensuring the correctness of regular expressions, some take the formal definition of regular expressions into account, while some consider various dialects. In this survey, we do not distinguish the differences of the target regular expressions. We classify and compare those works according to the topics on which they focus.

Correctness
Simply speaking, a regular expression is correct if it exactly does what its designers and users intend it to do − no more and no less. The correctness involves two levels: syntax correctness and semantic correctness. Syntax errors can be checked by compilers. However, when regular expressions are embedded in source programs, the compilers usually treat them as literal strings, and syntactical errors are reported only by throwing exceptions at run time. Thus, some tools are developed to help developers to compose syntactically correct regular expressions. For example, almost all regular expression editors provide syntax highlighting or syntax colouring.
The semantic correctness of a regular expression means that the language defined complies with the intended language, i.e., it meets the users′ requirements. More specifically, a regular expression is semantically correct if it defines all the strings intended to be accepted and does not define any strings intended to be rejected. Ensuring semantic correctness is difficult since the intended language is not easy to be formally specified. Different methods and approaches have been proposed to check the semantic correctness of regular expressions or assist developers in writing semantically correct regular expressions. For example, generate a set of strings from the regular expression to validate whether they conform to users′ intention, or automatically learn a regular expression from a set of intended strings given by users. Almost all the works reviewed in this paper are devoted to ensuring the semantic correctness of regular expressions.

Automata
Formally, a non-deterministic finite automaton (NFA) is a 5-tuple A = (Σ, Q, q 0 , F, δ), in which Σ is the alphabet, Q is a finite set of states, q 0 is the initial state, F is the set of final states and δ is the transition function that maps each pair of a state and a symbol to a set of states. A string w = a 1 a 2 ···a n is accepted by an automaton A if and only if there exists a sequence of states q 0 q 1 ···q n such that q n is one of the final states, and q i ∈δ(q i−1 , a i ) for each i∈ [1,n]. A deterministic finite automaton (DFA) is a special case of NFA, in which the transition function δ maps each pair of a state and a symbol to a singleton set or the empty set Ø. If a regular expression contains only regular operators, it can be equivalently converted to an NFA representation or a DFA representation. Some regular expressions support non-regular features such as backreferences, and they may require more complex automaton representations.

Empirical study
To ensure the correctness and improve the quality of regular expressions, it is quite important and necessary to first investigate the practical issues which developers face within regular expression usage, including, for instance, what kind of bugs addressed mainly by developers, what kind of operators/features are mostly used in practical applications, what kind of representations are more understandable than others, etc. Those works are usually conducted by empirical studies. In this section, we review the recent results on the empirical research of regular expressions.

Testing status and bugs classification
Developers have reported that they test their regular expressions less than the rest of their code [8] . Wang and Stolee [14] further investigated how thoroughly tested regular expressions are by examining open source projects. They used test metrics for graph-based coverage [21] over the DFA representation of regular expressions, including node coverage, edge coverage, and edge-pair coverage. Their evaluation shows that only 16.8% of the regular expressions are executed by the test suites, and among the tested regular expressions, half (41.9%) contain only one test string, and a majority (72.7%) use only positive input strings or only negative input strings. Edge coverage and edge-pair coverage are both very low (less than 30% on average). These results reveal that regular expressions are not well tested as expected in practice, which may be responsible for many software faults.
In a follow-up work, Wang et al. [22] presented an empirical study on regular expression bugs in real-world open-source projects. By analyzing a sample of merged pull request bugs related to regular expressions, they show that about half of the bugs are caused by incorrect regular expression behavior, and the remaining are caused by incorrect API usage and other code issues that require regular expression changes in the fix. Among the incorrect regular expression behaviors, about two-thirds fall into the category of rejecting valid strings, one-fifth accept invalid strings, and one-tenth accept invalid strings and rejecting valid strings. These indicate that developers are more likely to write incorrect regular expressions that are too constrained than the regular expressions intended to use. The authors also observe that fixing regular expression bugs is nontrivial as it takes more time and more lines of code to fix them compared with the general pull requests.

Feature usage and evolution
Chapman and Stolee [8] studied how regular expressions are used in practice and what features of regular expressions are most commonly used in practice. Their findings show that regular expressions are frequently used to locate the content within a file, capture parts of strings, and parse user inputs. They have identified eight regular expression operators/features most commonly used in Python projects. These features include one-or-more repetitions, group (parenthesis), zero-or-more repetitions, character range, etc. In a follow-up work, Chapman et al. [23] further explored how different features impact the readability, comprehension, and understandability of regular expressions. Their study reveals that some features make regular expressions more readable and understandable, while some do not. For example, "\d" is semantically equivalent to "[0−9]". While "\d" is more succinct, "[0−9]" may be easier for developers to read and thus may help to avoid potential errors.
The work of [24] is devoted to regular expression evolution, studying the syntactic and semantic differences and feature changes of regular expressions over time. The main results include: most edited expressions have a syntactic distance of 4−6 characters from their old versions; over half of the edits tend to expand the scope of the expression, indicating that the old versions define languages smaller than the intended ones. These results can help to design mutation operators and repair operators to assist with the testing and fixing of regular expressions.

Composition, re-use, and risks
Bai et al. [25] conducted an exploratory case study to find how developers compose their regular expressions during the development of their projects. They find that a large majority of developers search online resources such as Q&A sites (e.g., stackoverflow.com, online forums), repositories, or libraries (e.g., RegExLib [26] ) during their problem-solving tasks. This verifies that writing a correct regular expression is usually difficult, and developers may prefer to re-use or modify an existing expression rather than write it from scratch. An earlier study [27] has reported that the use of regular expressions is becoming highly repetitive, although they are being used more and more often, and on the most popular websites gathered in their study, only 4% of the regular ex-pressions are unique. However, as noted in [25], syntactical errors often arise when participants copy/paste expressions from another language to their projects.
Davis et al. [28] pointed out that even no syntactical errors appear during regular expression re-using, the semantic and performance characteristic of expressions may no longer retain. They surveyed a number of professional developers to understand their regular expression re-use practices and empirically measured the semantic and performance portability problem introduced by re-using regular expressions. Their results show that 15% exhibit semantic differences across languages and 10% exhibit performance differences across languages, although most regular expressions compile across language boundaries without syntactic errors. It implies that the re-use of regular expressions may introduce bugs. Semantics inconsistency means that the same regular expression may match different sets of strings across programming languages, resulting in logical errors. Performance inconsistency means that the same regular expression may have different worst-case performance behavior, resulting in regular expression denial of service (ReDoS) vulnerabilities [9,29] . Moreover, they state that they have identified hundreds of source modules containing potential semantic bugs or potential security vulnerabilities.
To find out how developers work with regular expressions and the difficulties they face, Michael et al. [30] provided a study of the regular expression development cycle by interviewing a number of professional developers. In general, developers say that regular expressions are hard to read, hard to search for and hard to validate. Besides, most developers are also unaware of critical security risks that can occur when using regular expressions.

Determinism
A regular expressions is said deterministic if we always know definitely the next symbol we will match in the expression without looking ahead in the string, when we match a string from left to right against the expression [31] . For instance, "(a|b) * a" is not deterministic as the first symbol in the string "aaa" could be matched by either the first or the second a in the expression. Without looking ahead, it is impossible to know which one to choose. The equivalent expression "b * a(b * a) * ", on the other hand, is deterministic. Deterministic regular expressions allow pattern matching more efficiently than the general ones. Several decision problems also behave better for deterministic expressions. For example, language inclusion for general expressions is PSPACE-complete but is tractable when the expressions are deterministic. Some applications require the determinism of regular expressions, while some do not. For example, W3C specification requires that the content models of XML schema language DTDs and XSDs [17] must be deterministic regular expressions, while there are no determinism restric-tions on Relax NG [18] and regular expressions used for string pattern matching. The work in [32] is devoted to finding how deterministic real regular expressions are. Li et al. [32] found that more than 98% of regular expressions in Relax NG are deterministic, although Relax NG does not have the determinism constraint for its content models. Besides, more than half of regular expressions from RegExLib are deterministic. These results indicate that deterministic regular expressions are commonly used in practice. Therefore, exploring effective methods to ensure the quality of such expressions is worthy of attention.

Others
Davis et al. [33] performed an empirical study by comparing two different notations (textual and graphical) of regular expressions and considering different factors such as the length of regular expressions to find how these factors affect the readability of regular expressions. They used the time required for finding the shortest strings as the primary measurement in their experiments. Their findings show that the graphical notation of regular expressions is much more readable than the textual notation and that the length has a strong effect on the regular expression′s readability while the participants′ background shows no measurable effect. Different empirical research may follow different methodologies to extract regular expressions, e.g., from different sources and written in only one or two programming languages. Zheng et al. [34] tried to find whether existing empirical research results can be generalized. They report that significant differences exist in some characteristics by programming languages and suggest that empirical methodologies should consider the programming languages, as the generalizability is not always assured for regular expressions supported in different programming languages.

Test string generation
Testing is a common way to ensure the correctness of regular expressions. The purpose of regular expression testing is to check whether the defined language meets the specification of users. One straightforward way to achieve this purpose is to automatically generate a number of strings from the regular expression under testing and check whether they comply with the intended language. The generated test strings can be positive or negative. If users find that a generated positive string should be rejected or a generated negative string should be accepted, this indicates that the regular expression is incorrectly defined.

Coverage based generation
Coverage criteria are used to measure the quality of a particular test set and to provide strategies for test data generation algorithms. Coverage criteria are usually defined with respect to programs. In [33], a notion of pairwise coverage is proposed that is defined for regular expressions. The idea adopts pairwise testing [35] , a combinatorial method for testing software systems. For each pair of input parameters to a software system, pairwise testing tries to test all possible combinations of these two parameters. Similarly, pairwise coverage for regular expressions tests all possible combinations of any two subexpressions concatenated in the regular expression under testing. Furthermore, to avoid generating infinite strings, pairwise coverage restricts only three typical possibilities for Kleene star *: zero, one, and more-than-one repetitions. Consider the following regular expression "a * b * c * " for example. The string sets {ε, a, aaa}, {ε, b, bb} and {ε, c, cccc} cover the three typical repetitions for these subexpressions "a * ", "b * " and "c * " respectively. Pairwise coverage criterion requires that all combinations of any two of these three string sets should be covered. For instance, the string "aaabbcccc" covers the combinations ("aaa", "bb"), ("aaa", "cccc") and ("bb", "cccc"). The string set achieving pairwise coverage for a regular expression is not necessarily unique. A string generation algorithm that given as an input a regular expression outputs a small set of strings that satisfies the pairwise coverage criterion is implemented. The algorithm generates only positive strings. Besides, it considers the formal definition of regular expressions and supports only basic operator concatenation, alternation, Kleene star, and two extended regular operators counting and interleaving. The algorithm can be used for testing content models of XML schemas but may not be suitable to be used directly for testing expressions in other specialized applications such as string pattern matching, where regular expressions have a very different syntax.
Egret [11] generates strings from regular expressions based on the underlying automata. It first converts a regular expression into a specialized automaton, then derives a set of basis paths of the resulting automaton, and finally creates strings from the basis paths. In this sense, we may say that the generated strings cover the basis paths of the specialized automaton. Egret focuses on reg-ular expression patterns, i.e., expressions used in manipulating text strings. Compared with pairwise coveragebased generation, it allows more regular operators such as the character class. We next use the regular expression "a? [2][3][4][5][6][7][8][9](b|c)\d{3}(e|f)" as an example to explain Egret′s generation process. The generation is divided into three steps.
Step 1. Egret first converts the expression into a specialized automaton, as shown in Fig. 1. In this automaton, transitions can be labeled by an individual symbol, a character set, or an epsilon. Special "begin repeat" and "end repeat" states are added to the beginning and end of each repeating operator, respectively.
Step 2. The specialized automaton is then traversed to obtain a set of basis paths, and for each basis path, an initial test string is generated. For the automaton in Fig. 1, we have the following basis paths and initial sets of strings.
Basis paths Initial strings Step 3. This step creates additional strings from the initial set of strings. Two strategies are used for creating additional strings: 1) altering the number of iterations for each repeat operator and 2) changing the character used for a character set. For instance, consider the initial string "a2bddde" derived from Path 1. By replacing "a" with ε and "aa" (0 and 2 repetitions for "a?"), it obtains one positive string "2bddde" and one negative string "aa2bddde". By replacing "2" with "0" (one digit outside [2−9]), it obtains one negative string "a0bddde". By replacing "ddd" with "dd" and "dddd" (2 and 4 repetitions for {3}), it obtains two negative strings, "a2bdde" and "a2bd ddde".
Due to different creation strategies of additional strings in Step 3, Egret can produce not only positive test strings but also negative test strings. However, Egret′s generation strategy especially for negative strings is relatively simple. For example, "(a|b) + " generates only one negative string ε, and "(a|b) * " generates no negative strings. Some faults may not be revealed for complex regular expressions.

Mutation based generation
Mutation testing is a fault-based technique to design new test data or to evaluate the quality of existing test data [36] . It involves introducing into the software artifact under test small changes, called mutations, which represent typical mistakes that developers could make. Each mutated version is called a mutant. Test sets detect and reject mutants by causing the behavior of the original version to differ from the mutant; it is called killing the mutant. Arcaini et al. [37] applied mutation techniques to regular expressions for test string generation. They identified a family of possible faults on regular expression definitions representing the common mistakes programmers make when writing regular expressions. A set of mutants is created according to those faults, and then a set of critical strings that are able to distinguish the given regular expression from its mutants are generated. We next briefly describe this generation approach with an example.
One possible fault in defining regular expressions is that a programmer could have used the wrong repetition operator. Suppose that a programmer wants to define a regular expression accepting all the non-empty sequences of digits; they could have wrongly written it as "[0−9] * " instead of the correct one "[0−9] + ". According to this fault, Arcaini et al. [37] proposed the mutation operator quantifier change (QC) that mutates each simple repetition operator into another simple repetition operator, and for each user-defined operator {m}, creates a mutant in which m (and also n if the operator is {m,n}) is increased and a mutant in which it is decreased. Given a regular expression R and one of its mutant M, a string s is said to be able to distinguish R from M if s is accepted by R and not by M, or vice versa. That is, s is a string of the symmetric difference between R and M. For example, the empty string ε is distinguishing for regular expression "[0−9] * " and its mutant "[0−9] + ". As long as the mutant is not equivalent to the original expression, one can always find certain distinguishing strings. The string generation algorithm developed in [37] uses 14 mutation operators. For a regular expression R, the algorithm first mutates it to obtain a set of mutants, and then for each non-equivalent mutant, it computes the symmetric difference of the automaton representation of the original regular expression and the mutant. If the symmetric difference is not empty (which means that the mutant is not equivalent), it randomly selects a distinguishing string from the symmetric difference set.
This mutation-based generation could produce both positive strings and negative strings. However, since the number of mutants generated after using the 14 muta-tion operators is very large, the generation process usually takes much time, and the generated string set may be extensive, especially for long and complex regular expressions.
Some researchers adopted the technique of mutation testing to validate XML schemas [38,39] . Since the content models of XML data are described by regular expressions, schema testing reduces to some extent the testing of expressions. Mutation operators applied to different XML schema components are proposed, and mutation-based generation of XML data is implemented. The generation algorithm usually produces only positive XML data.

Sampling and enumeration
In formal language theory, sampling and enumerating are two fundamental problems concerning the generation of regular languages. Sampling focuses on generating a uniformly random string of length n of a regular language so that strings of length n in that language all have the same probability of being generated [40,41] . Enumerating tries to enumerate all the distinct strings or all strings of length n of a regular language in lexicographical order [42,43] . However, most of the existing sampling and enumerating algorithms take finite automata or regular grammars as their inputs. Radanne and Thiemann [44] presented an enumeration-based algorithm that takes regular expressions as direct input. These regular expressions can be extended with intersection and complement operators. The algorithm generates both positive and negative strings, which can be used to test regular expression parsers as well as to test regular expressions themselves.
There are many practical tools online available for generating strings from regular expressions, such as Xeger [45] , Exrex [46] , Generex [47] , and Regldg [48] . These tools generate strings either randomly or systematically. For example, Xeger randomly generates test strings for a regular expression and allows the users to specify how many strings to be generated. Exrex produces all matched strings for a given regular expression, and in case that the matched strings are infinite, users are asked to restrict the number of repetitions of the Kleene star operator or restrict the number of generated strings. All these tools generate only positive strings. The fault detection ability of the generated strings is not quite satisfactory compared with coverage-based generation or mutation-based generation methods, as illustrated in [11,33,49].

Learning
Since writing a regular expression for a specific task can be time-consuming and error-prone and require special skills and familiarity with the formalism involved in constructing regular expressions, some researchers have been working on synthesizing or learning regular expres-sions automatically, either from a set of sample strings or from a natural language specification described by a human being.

Learning from examples
The problem of synthesizing regular languages from examples is a traditional topic in formal language theory. Prior works concentrated mostly on learning deterministic finite automata [50,51] . Recent emerging research begin to consider regular expressions as the learning target. The examples can contain positive strings or both positive and negative strings.
Brauer et al. [52] devised an automaton-based approach that learns regular expressions for information extraction from only positive samples. Bartoli et al. [53] developed a system that automatically creates a regular expression from a set of input strings annotated by users, which parts of strings to be matched (positive) and which parts not to be matched (negative). The regular expressions learned are expected to generalize the matching behavior represented by examples. They have demonstrated that their system has similar performance in both time and accuracy with an experienced developer [54] . Also, the system adopts the idea of genetic programming in the generation of regular expressions. Genetic programming is a heuristic technique searching for an optimal or at least suitable solution for a target problem. It starts with an initial set of candidate solutions built usually at random and then repeatedly evolves by trying to build new candidate solutions from existing ones using genetic operators and meanwhile discarding worst candidate solutions. A problem-dependent function called fitness is defined in order to quantify the ability of each candidate solution to solve the target problem. The evolving procedure is repeated a predefined number of times or until a satisfying solution is found: For example, a solution with perfect fitness is found. The system developed in [53] adapted this framework to the specific problem of regular expression learning from examples. Each candidate regular expression is represented by an abstract syntax tree and two fitness functions (one concerning the length of the candidate expression and the other concerning the matching results of the candidate and the given examples) are defined to measure the quality of each candidate expression.
Along the same lines as [53], subsequent regular expression learning algorithms have been proposed for entity extraction [55] or text extraction [56] . Those genetic programming-based learning techniques, although powerful, usually execute slowly. It may take many minutes to obtain the synthesized results. Lee et al. [57] proposed to rapidly infer the simplest regular expression over a binary alphabet from a set of positive and negative examples, which can then be interactively used by students to assist them in studying and understanding regular expressions.

Sc ⊆ S
Another class of regular expression learning algorithms follows Gold′s framework of learning (identification) in the limit [58] , which is explained as follows. Let Γ be a subclass of regular expressions. Γ is said to be learnable or identifiable if there is an algorithm ρ mapping sets of example strings to expressions in Γ such that 1) S is a subset of ρ(S) for every example set S and 2) to every regular expression R of Γ, we can associate a socalled characteristic sample S c such that, for each example set S with , ρ(S) is equivalent to R. Intuitively, the first condition says that algorithm ρ must be sound; the second condition says that it must be complete, i.e., ρ should converge when the sample contains enough data.
It was shown by Gold [58] that the class of all regular expressions is not learnable from only positive data. Therefore, researchers have turned to identify subclasses of regular expressions that can be learnable, such as single occurrence regular expressions and chain regular expressions [59,60] , and simple looping regular expressions [61] . The learning of such special expressions usually contains two steps: First, it constructs an automaton from the given example strings, and then it derives a target regular expression from the automaton. We next take the learning of single occurrence and chain regular expressions as an example to explain the two steps.
A regular expression is called single occurrence if every alphabet symbol occurs at most once in it. If a single occurrence regular expression is of the form f 1 ···f n (n > 1) where each f i is a chain factor: an expression of the form (a 1 |···|a k ), (a 1 |···|a k )?, (a 1 |···|a k ) + , or (a 1 |···|a k ) + ? with k ≥ 1 and every a i is an alphabet symbol, then this expression is called a chain regular expression. For example, "(a|b) + ?c?(d|e|f) + " is a chain regular expression while "(ab + |c)?(d?|e|f + ) + " is not. Given a set of examples, the learning algorithm first constructs a single occurrence automaton using the classical 2T-INF algorithm [59] . A single occurrence automaton is a specific kind of DFA in which no edges are labeled and all states, except for the initial and final state, are symbol names. Fig. 2 shows the single occurrence automaton constructed from examples {aab, cdd}. The learning algorithm then transforms the automaton into a single occurrence regular expression using a set of graph-based rewriting rules. If the target expression is a chain regular expression, the transformation adopts a different strategy. For example, the a b Final Initial d c Fig. 2 Example of single occurrence automaton single occurrence regular expression derived from the automaton in Fig. 2 is "(a + b)|(cd + )", while the chain regular expression derived from this automaton is "a + ?c?b?d + ?".
The algorithms following Gold′s learning model can be theoretically proved sound and complete. Since the regular expressions learned are restricted, such subclasses learning algorithms are usually used in specific applications, such as the automatic inference of XML schemas.

Learning from specifications
Some researchers stated that although writing regular expressions is time-consuming and error-prone, it is often much easier for users to specify their tasks or requirements in natural language. Therefore, automatic learning of regular expressions from natural language specifications is necessary and meaningful to help reducing possible errors caused by incorrect regular expressions. Approaches have been recently proposed for automatic generation of regular expression from specifications, by training a probabilistic parsing model from natural language sentences and then generating regular expressions from the model [62] , or by using a sequence-to-sequence learning model to directly translate natural language sentences to regular expressions [63,64] . For example, Locascio et al. [63] created a corpus of regular expression and natural language pairs using the grammar rules such as "[0−9] → a number", "(x) * → x zero or more times", ". * x → ends with x" and ". * x. * → contains x". They then applied the long short-term memory (LSTM) model to train this corpus to accomplish the natural text descriptions to regular expressions translation. However, as pointed out in [65], these approaches use only synthetic data in their training datasets and validation/test datasets, and they may not be effective in handling real-world situations.
Taking into account that natural language specifications alone are often ambiguous and that examples in isolation are often not sufficient for conveying the user intent, Chen et al. [66] proposed a multi-modal synthesis technique for creating regular expressions from a combination of examples (including both positive and negative) and natural language specifications. The implemented tool produces top-k results that satisfy the examples and natural language descriptions, but it is still up to the user to check the results and to provide more information if needed.

Syntax checking
Formal methods and verification are important to improve the quality of software. However, most of the static analysis tools only consider program code, and do not check regular expressions. Moreover, compilers usually treat regular expressions in the source program as literal strings, and syntactic errors of regular expressions are reported by throwing exceptions at run time. Spishak et al. [10] designed a system that validates the regular expression syntax at compile time. In addition, the system checks the use of incorrect capturing group numbers that results in an exception error at run time. For example, the regular expression "(a * )([0−9] + )" contains two groups delimited by parentheses. If a program accesses the result matched by the third group, then an error occurs because there is no group three in this regular expression.

Semantic checking
Some researchers [67] focused on regular expressions used to specify the structure of elements and attributes in XML documents. The type system designed for such regular expressions aims at determining the program′s semantic errors, such as subtype checking [68,69] . Subtype checking means that given a function in which the input and output are specified by regular expressions R 1 and R 2 , respectively, the type system statically verifies that for any input that matches R 1 , the output of this function always matches R 2 . This checking can detect typerelated errors in the function. If subtype checking fails, it might also indicate that the regular expressions specifying input and output types are not correctly defined.
Automatic checking of regular expressions (ACRE) [70] attempts to statically detect semantic errors of regular expressions used mainly in the area of string pattern matching. It performs eleven checks on regular expressions based on common mistakes for developing regular expressions. For example, it checks invalid ranges within a character set such as [A−z]. It is more likely that the correct range should be [A−Z] or [a−z]. For another example, it checks whether "braces" are balanced in the matching strings; it helps to detect cases where strings with unbalanced braces are incorrectly accepted. Consider the syntactically correct expression "[{\(]?[0− 9]{3}[\)}]?" for example. The tools report that a bracesunbalanced string "(000}" is accepted, which might not be expected.

Verification
A verification framework for regular expressions is proposed in [71]. The framework does not adopt the testing techniques to validate regular expressions. Instead, it allows users to express their expression requirements using natural language, which are then compiled into a kind of domain-specific formal specifications. It then checks the consistency between the formal specifications and the verified regular expressions using equivalence checking.
We use the motivating example from [71] to briefly explain how the verification framework works. Suppose that the requirement of the regular expression to be veri-fied is informally stated using natural language, the valid string should contain a sub-string X such that before X there are only c′s, after X there are only d′s, in X, b′s and a′s alternate, and the first symbol in X is a, and the last symbol is b. This requirement description is then further translated into the following formal specifications: my@spec = ( let X ← [a|b]*, // in X, b′s and a′s alternate X ← (? = a.*b) in e; // in X, begin in a end in b let e ← (? = c*X.*), // there are only c′s before X e ← (? =.*Xd*) in S; // there are only d′s after X ); Finally, the framework converts this formal specification into an automaton and conducts the equivalence checking on this specification automaton, and the target regular expression converted automaton, if they are equivalent, then the target regular expression is considered correct; otherwise, the framework returns the "failed" information with counterexamples. For instance, if the verified expression is "c * (ab) * d*" for the above example, the framework reports "failed" with a counterexample "caabd", which is matched with the specifications but not the target regular expression. In such a way, incorrect regular expressions are detected. However, as pointed out in [71], the translation of formal specifications from natural language descriptions is not always accurate and still needs the users′ interaction.

Visualization and abstraction
It is a common agreement that regular expressions, due to their complexity and compactness, introduce large challenges to the composition and comprehension for developers. Visual representations or explanations of the underlying structures of regular expressions can help to improve the readability and understandability and thus the quality of regular expressions. Many efforts have been put into this direction.

Highlighting
Highlighting the syntax of regular expressions is a traditional way of visualization that is supported in almost all text editors. Apart from this, some tools [72−74] provide debugging environments that can explain string matching results by highlighting the parts of regular expressions matching a certain string or highlighting the strings matched by regular expressions. This helps to check whether the matching results are expected.

Graphical representation
A regular expression can be transformed into an equivalent automaton, which in turn can be visualized as an diagram. Several tools [75−79] , such as RegExpert [75] and RegExper [76] , are developed to visualize regular expres-sions as their corresponding automaton-like graph representations. There are also some tools [74,80] that provide additional tree views that show the hierarchical structure of the regular expression components, enabling developers to easily understand and track the structural and functional relationships among the sub-expressions or components contained in the regular expression. Oflazer and Yιlmaz [81] described a visual interface and a development environment for developing regular expressions that allow the users to construct complex regular expressions via a drag anddrop visual interface. Beck et al. [82] proposed to show the structures of regular expressions by augmenting the original textual representations with visual elements instead of using automaton-like graph representations or tree-view representations. The developed tool highlights the structure of a regular expression through horizontal lines at the top and bottom of the expression and discerns special-purpose tokens by color. For example, it highlights those repetition operators (*,?, {m,n}, etc.) in yellow color and highlights the union operator (|) in blue color. Groups in the regular expression are highlighted through horizontal lines attached to the bottom of the expression. The stacking of these lines reflects the nesting hierarchy of groups. Feedback from users show that such a visualization approach is "intuitive", "clear", "self-explaining", "easy to understand", "helpful" and "useful".

Abstraction
Erwig and Gopinath [12] tried to establish an abstraction mechanism to serve as explanations of regular expressions. They identified and abstracted the common sub-expressions or components occurred in a regular expression, then introduced names for those common subexpressions, and finally obtained a representation that directly reveals the overall structure of the original regular expression. As an example consider the regular expression (taken from [12]

Repairing
Repair of regular expressions is usually done by the synthesis from examples. That is, if a regular expression is incorrect, then a new set of examples is provided from which the correct expression that is consistent with the given examples is synthesized or learned. Li et al. [83] aimed at repairing regular expressions that define languages larger than the intended ones. Therefore, to repair a faulty regular expression, a new set of negative examples are required and the goal is to modify the original expression so that it rejects the new examples. On the contrary, Rebele [84] focused on repairing regular expressions that define languages smaller than the intended ones. Thus, a new set of positive examples is provided, and the goal is to modify the original expression so that it accepts the new examples. The repair processes of these two works use a set of heuristics to transform an initial regular expression into a modified one that accepts/rejects the new examples. The transformation rules include: remove a disjunct from a union, for example, "a|b|c" to "a|c", or restrict the repetition range of a repetition operator, for example, "a{1,3}" to "a{1,2}", etc. However, these repairs do not provide any minimality guarantees and may produce regular expressions that are very different from the original ones.
Pan et al. [85] repaired regular expressions with both positive and negative examples, and it guaranteed to find the syntactically smallest repair of the original regular expression. Here, "smallest" is measured by the edit distance between the abstract syntax trees of the initial regular expression and the target regular expression. The repair algorithm first generates a set of initial templates based on the initial regular expression and the given positive and negative examples. It then processes the templates by discarding templates that cannot result in a correct repair, generating new templates based on the current ones until an optimal regular expression is obtained. Some works have been conducted to study how to repair the ReDoS vulnerable regular expressions [86,87] .
Arcaini et al. [88] devised an evolutionary approach to testing and repairing regular expressions. The approach starts from an initial guess of the regular expression; then, it repeatedly generates meaningful strings to check whether they are accepted or not and tries to repair the desired solution consistently. Cochran et al. [89] proposed a genetic programming approach to repairing regular expressions. It uses genetic programming operators over the DFA representation of the regular expression, and then the obtained DFA is converted back to a regular expression. The conversion from DFAs could yield expressions that are completely different from the original ones.
Arcaini et al. [90] presented an iterative mutation-based process for testing and repairing regular expressions. For a regular expression, the approach generates a set of strings that distinguishes the regular expression from its mutants and asks the users to assess the correct evaluation of these strings. If a mutant evaluates these strings more correctly than the original regular expression, then it substitutes the faulty expression with this mutant. This process iterates until no mutants better than the original expression are found. However, the repair process requires a lot of users′ efforts as they are frequently asked to assess the correctness of the strings′ evaluation or to judge whether a mutant is better than the original expression.

Conclusions
Regular expressions are widely used in different fields within and even outside of computer science. Ensuring the correctness of regular expressions is a vital prerequisite for their usage in practical applications. In recent years, efforts have been made to assist users or programmers in writing correct regular expressions or to validating the correctness of regular expressions. In this paper, we have conducted a review around this topic. In particular, we have classified existing relevant works into six categories, including 1) empirical study on various problems in the development of regular expressions, 2) test string generation from regular expressions, 3) automatic learning or synthesis of regular expressions from example strings or specifications, 4) statically checking the syntax or specific semantic errors in regular expressions or verifying regular expressions with specifications, 5) visual representations or explanations of the underlying structures of regular expressions, and 6) repairing of faulty regular expressions. For each category, we have reviewed different approaches and discussed their advantages and disadvantages. Table 2 provides an overview of our classification.
The importance of ensuring the correctness of regular expressions is just beginning to be addressed, and the current research progress on this topic is far from enough. There are still a few research problems waiting for new solutions and tools, and we list some in the following: 1) Generation of negative strings. Generating test strings is a common yet effective way to discover errors in regular expressions. Empirical studies show that a majority of faulty regular expressions define languages that are too constrained, i.e., they reject valid strings. Such a kind of fault cannot be detected by positive strings generated from the incorrect expressions. Therefore, generating meaningful negative strings outside of the regular expression languages is an important problem to study.
2) Refactoring of regular expressions. The lack of readability is usually a pain point for composing and reusing high-quality regular expressions. Thus, refactoring transformations are needed to enhance the readability or comprehension of regular expressions. For example, \d is semantically equivalent to [0123456789] and [0−9]. While \d is more succinct, [0−9] may be easier to read.
3) Fault location and automatic repair. Fault detection and diagnosis is, in general, a challenging problem [91] . Nevertheless, even though a regular expression is detected incorrectly defined, locating the faulty areas and further repairing them are even more difficult, as practical expressions are usually large and have complex structures. Existing works on fault location and automatic repair of regular expressions are relatively insufficient. Much attention is needed to be paid to these issues.

Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article′s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article′s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.