POSIX Lexing with Derivatives of Regular Expressions

Brzozowski introduced the notion of derivatives for regular expressions. They can be used for a very simple regular expression matching algorithm. Sulzmann and Lu cleverly extended this algorithm in order to deal with POSIX matching, which is the underlying disambiguation strategy for regular expressions needed in lexers. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. In this paper we give our inductive definition of what a POSIX value is and show that Sulzmann and Lu’s algorithm always generates such a value. We also show that our inductive definition of a POSIX value is equivalent to an alternative definition by Okui and Suzuki which identifies POSIX values as least elements according to an ordering of values.


Introduction
Brzozowski [4] introduced the notion of the derivative r\c of a regular expression r w.r.t. a character c, and showed that it gave a simple solution to the problem of matching a string s with a regular expression r: if the derivative of r w.r.t.(in succession) all the characters of the string matches the empty string, then r matches s (and vice versa).The derivative has the property (which may almost be regarded as its specification) that, for every string s and regular expression r and character c, one has cs ∈ L(r) if and only if s ∈ L(r\c).The beauty of Brzozowski's derivatives is that they are neatly expressible in any functional programming language, and easily definable and reasoned about in theorem provers-the definitions just consist of inductive datatypes and simple recursive functions.A mechanised correctness proof of Brzozowski's matcher in for example HOL4 has been mentioned by Owens and Slind [18].Another one in Isabelle/HOL is part of the work by Krauss and Nipkow [11].And another one in Coq is given by Coquand and Siles [5].Also Ribeiro and Du Bois give one in Agda [20].
If a regular expression matches a string, then in general there is more than one way of how the string is matched.There are two commonly used disambiguation strategies to generate a unique answer: one is called GREEDY matching [8] and the other is POSIX matching [12,16,21,23,24]. 1 For example consider the string xy and the regular expression (x + y + xy) * .Either the string can be matched in two 'iterations' by the single letter-regular expressions x and y, or directly in one iteration by xy.The first case corresponds to GREEDY matching, which first matches with the left-most symbol and only matches the next symbol in case of a mismatch (this is greedy in the sense of preferring instant gratification to delayed repletion).The second case is POSIX matching, which prefers the longest match.
In the context of lexing, where an input string needs to be split up into a sequence of tokens, POSIX is the more natural disambiguation strategy for what programmers consider basic syntactic building blocks in their programs.These building blocks are often specified by some regular expressions, say r key and r id for recognising keywords and identifiers, respectively.There are a few underlying (informal) rules behind tokenising a string in a POSIX [23] fashion: • The Longest Match Rule (or "Maximal Munch Rule") The longest initial substring matched by any regular expression is taken as next token.
• Priority rule For a particular longest initial substring, the first (leftmost) regular expression that can match determines the token.• Star rule A subexpression repeated by * shall not match an empty string unless this is the only match for the repetition.• Empty string rule An empty string shall be considered to be longer than no match at all.
Consider for example the regular expression r key for recognising keywords such as if, then, while and so on; and r id for recognising identifiers (say, a single character followed by characters or numbers).Then we can form the regular expression (r key + r id ) * and use POSIX matching to tokenise strings, say iffoo and if.For iffoo we obtain by the Longest Match Rule a single identifier token, not a keyword followed by an identifier.For if we obtain by the Priority Rule a keyword token, not an identifier token-even if r id matches also.By the Star Rule we know (r key + r id ) * matches iffoo, respectively if, in exactly one 'iteration' of the star.The Empty String Rule is for cases where, for example, the regular expression (a * ) * matches against the string bc.Then the longest initial matched substring is the empty string, which is matched by both the whole regular expression and the parenthesised subexpression.
One limitation of Brzozowski's matcher is that it only generates a YES/NO answer for whether a string is being matched by a regular expression.Sulzmann and Lu [21] extended this matcher to allow generation not just of a YES/NO answer but of an actual matching, called a lexical value.Assuming a regular expression matches a string, values encode the information of how the string is matched by the regular expression-that is, which part of the string is matched by which part of the regular expression.For this consider again the string xy and the regular expression (x + (y + xy)) * (this time fully parenthesised).We can view this regular expression as a tree and if the string xy is matched by two Star 'iterations', then 1 POSIX matching acquired its name from the fact that the corresponding rules were described as part of the POSIX specification for Unix-like operating systems [23].where Stars has only a single-element list for the single iteration and Seq indicates that xy is matched by a sequence regular expression.This 'tree view' leads naturally to the idea that regular expressions act as types and values as inhabiting those types (see, for example, [10,14]).
Sulzmann and Lu give a simple algorithm to calculate a value that appears to be the value associated with POSIX matching.The challenge then is to specify that value, in an algorithmindependent fashion, and to show that Sulzmann and Lu's derivative-based algorithm does indeed calculate a value that is correct according to the specification.The answer given by Sulzmann and Lu [21] is to define a relation (called an "order relation") on the set of values of r, and to show that (once a string to be matched is chosen) there is a maximum element and that it is computed by their derivative-based algorithm.This proof idea is inspired by work of Frisch and Cardelli [8] on a GREEDY regular expression matching algorithm.However, we were not able to establish transitivity and totality for the "order relation" by Sulzmann and Lu.There are some inherent problems with their approach (of which some of the proofs are not published in [21]); perhaps more importantly, we give in this paper a simple inductive (and algorithm-independent) definition of what we call being a POSIX value for a regular expression r and a string s; we show that the algorithm by Sulzmann and Lu computes such a value and that such a value is unique.Our proofs are both done by hand and checked in Isabelle/HOL.The experience of doing our proofs has been that this mechanical checking was absolutely essential: this subject area has hidden snares.This was also noted by Kuklewicz [12] who found that nearly all POSIX matching implementations are "buggy" [21, p. 203] and by Grathwohl et al. [9,p. 36] who wrote: "The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions."

Contributions:
We have implemented in Isabelle/HOL the derivative-based regular expression matching algorithm of Sulzmann and Lu [21].We have proved the correctness of this algorithm according to our specification of what a POSIX value is (inspired by work of Vansummeren [24]).Sulzmann and Lu sketch in [21] an informal correctness proof: but to us it contains unfillable gaps. 2 Our specification of a POSIX value consists of a simple inductive definition that given a string and a regular expression uniquely determines this value.We also show that our definition is equivalent to an ordering of values based on positions by Okui and Suzuki [16].

Preliminaries
Strings in Isabelle/HOL are lists of characters with the empty string being represented by the empty list, written [], and list-cons being written as :: .Often we use the usual bracket notation for lists also for strings; for example a string consisting of just a single character c is written [c].We use the usual definitions for prefixes and strict prefixes of strings.By using the type char for characters we have a supply of finitely many characters roughly corresponding to the ASCII character set.Regular expressions are defined as usual as the elements of the following inductive datatype: where 0 stands for the regular expression that does not match any string, 1 for the regular expression that matches only the empty string and c for matching a character literal.We use + and • for alternative and sequence regular expressions, respectively.We are adding here to the usual regular expressions also the regular expression for character sets, written cs where cs is a set of characters.Such character sets can of course be represented by using alternatives (and 0 if the set is empty) and therefore do not add anything new in terms of recognised languages.We include them here because they will show later on that the generality of a definition is required once such simple regular expressions are added.
The language of a regular expression is defined as usual by the recursive function L with the seven clauses: In clause (4) we use the operation @ for the concatenation of two languages (it is also list-append for strings).We use the star-notation for regular expressions and for languages (in the clause (6) above).The star for languages is defined inductively by two clauses: (i) the empty string being in the star of a language and (ii) if s 1 is in a language and s 2 in the star of this language, then also s 1 @ s 2 is in the star of this language.It will also be convenient to use the following notion of a semantic derivative (or left quotient) of a language defined as For semantic derivatives we have the following equations (for example mechanically proved in [11]): Brzozowski's derivatives of regular expressions [4] can be easily defined by two recursive functions: the first is from regular expressions to booleans (implementing a test when a regular expression can match the empty string), and the second takes a regular expression and a character to a (derivative) regular expression: We may extend this definition to give derivatives w.r.t.strings: Given the equations in (1), it is a relatively easy exercise in mechanical reasoning to establish that Proposition 1 (1)

nullable r i f and only i f [] ∈ L(r), and (2) L(r\c) = Der c (L(r)).
With this in place it is also very routine to prove that the regular expression matcher defined as gives a positive answer if and only if s ∈ L(r).Consequently, this regular expression matching algorithm satisfies the usual specification for regular expression matching.While the matcher above calculates a provably correct YES/NO answer for whether a regular expression matches a string or not, the novel idea of Sulzmann and Lu [21] is to append another phase to this algorithm in order to calculate a lexical value.We will explain the details next.

POSIX Regular Expression Matching
There have been many previous works that use values for encoding how a regular expression matches a string.The clever idea by Sulzmann and Lu [21] is to define a function on values that mirrors (but inverts) the construction of the derivative on regular expressions.Values are defined as the inductive datatype where we use vs to stand for a list of values.(This is similar to the approach taken by Frisch and Cardelli for GREEDY matching [8], and Sulzmann and Lu for POSIX matching [21]).The string underlying a value can be calculated by the flat function, written | | and defined as: We will sometimes refer to the underlying string of a value as flattened value.We will also overload our notation and use |vs| for flattening a list of values and concatenating the resulting strings.
Sulzmann and Lu follow Nielsen and Henglein and define inductively an inhabitation relation that associates values to regular expressions (see [14,21]).We define this relation as follows 3Empty : where in the clause for Stars we use the notation v ∈ vs for indicating that v is a member in the list vs.We require in this rule that every value in vs flattens to a non-empty string.The idea is that Stars-values satisfy the informal Star Rule (see Introduction) where the * does not match the empty string unless this is the only match for the repetition.Note also that no values are associated with the regular expression 0 (since it does not match any string), and that the only value associated with the regular expression 1 is Empty.We use the Char-value for both, single character regular expressions and character sets.It is routine to establish how values "inhabiting" a regular expression correspond to the language of a regular expression, namely Given a regular expression r and a string s, we define the set of all Lexical Values inhabited by r with the underlying string being s4

LV r s
The main property of LV r s is that it is always finite.

Proposition 3 finite (LV r s)
This finiteness property does not hold in general if we remove the side-condition about |v| = [] in the Stars-rule above.For example using Sulzmann and Lu's less restrictive definition, LV (1 * ) [] would contain infinitely many values, but according to our more restricted definition only a single value, namely If a regular expression r matches a string s, then generally the set LV r s is not just a singleton set.In case of POSIX matching the problem is to calculate the unique lexical value that satisfies the (informal) POSIX rules from the Introduction.Graphically the POSIX value calculation algorithm by Sulzmann and Lu can be illustrated by the picture in Fig. 1 where the path from the left to the right involving derivatives/nullable is the first phase of the algorithm (calculating successive Brzozowski's derivatives) and mkeps/inj, the path from right to left, the second phase.This picture shows the steps required when a regular expression, say r 1 , matches the string [a, b, c].We first build the three derivatives (according to a, b and c).We then use nullable to find out whether the resulting derivative regular expression r 4 can match the empty string.If yes, we call the function mkeps that produces a value v 4 for how r 4 can match the empty string (taking into account the POSIX constraints in case there are several ways).This function is defined by the clauses: Note that this function needs only to be partially defined, namely only for regular expressions that are nullable.In case nullable fails, the string [a, b, c] cannot be matched by r 1 and the null value None is returned by the algorithm.Note also how this function makes some subtle choices leading to a POSIX value: for example if an alternative regular expression, say r 1 + r 2 , can match the empty string and furthermore r 1 can match the empty string, then we return a Left-value.The Right-value will only be returned if r 1 cannot match the empty string.
The most interesting idea from Sulzmann and Lu [21] is the construction of a value for how r 1 can match the string [a, b, c] from the value how the last derivative, r 4 in Fig. 1, Fig. 1 The two phases of the algorithm by Sulzmann and Lu [21], matching the string [a, b, c].The first phase (the arrows from left to right) is Brzozowski's matcher building successive derivatives.If the last regular expression is nullable, then the functions of the second phase are called (the top-down and right-to-left arrows): first mkeps.calculates a value v 4 witnessing how the empty string has been recognised by r 4 .After that the function inj "injects back" the characters of the string into the values can match the empty string.Sulzmann and Lu achieve this by stepwise "injecting back" the characters into the values thus inverting the operation of building derivatives, but on the level of values.The corresponding function, called inj, takes three arguments, a regular expression, a character and a value.For example in the first (or right-most) inj-step in Fig. 1 the regular expression r 3 , the character c from the last derivative step and v 4 , which is the value corresponding to the derivative regular expression r 4 .The result is the new value v 3 .The final result of the algorithm is the value v 1 .The inj function is defined by recursion on regular expressions and by analysing the shape of values (corresponding to the derivative regular expressions). (

1) inj d c (Empty
To better understand what is going on in this definition it might be instructive to look first at the three sequence cases (clauses ( 5)-( 7)).In each case we need to construct an "injected value" for r 1 • r 2 .Because of the 'shape' of the regular expression, this must be a value of the form Seq . Recall the clause of the derivative-function for sequence regular expressions: Consider first the else-branch where the derivative is (r 1 \c) • r 2 .The corresponding value must therefore be of the form Seq v 1 v 2 , which matches the left-hand side in clause (5) of inj.In the if-branch the derivative is an alternative, namely (r 1 \c) • r 2 + (r 2 \c).This means we either have to consider a Left-or Right-value.In case of the Left-value we know further it must be a value for a sequence regular expression.Therefore the pattern we match in the clause One more interesting point is in the right-hand side of clause (7): since in this case the regular expression r 1 does not "contribute" to matching the string, that means it only matches the empty string, we need to call mkeps in order to construct a value for how r 1 can match this empty string.A similar argument applies for why we can expect in the left-hand side of clause (8) that the value is of the form Seq v (Stars vs)-the derivative of a star is (r\c) • r * .Finally, the reason for why we can ignore the first argument in clause (1) of inj is that it will only ever be called in cases where c = d, but the usual linearity restrictions in patterns do not allow us to build this constraint explicitly into our function definition. 5Similarly the clause in (2) will only be called in cases where c ∈ cs holds.Notable in this clause, however, is the fact that we cannot ignore the second argument of the injection function (the character that is injected into the value), because otherwise there is no way to determine which character from the character set should be injected.The idea of the inj-function to "inject" a character, say c, into a value can be made precise by the first part of the following lemma, which shows that the underlying string of an injected value has a prepended character c; the second part shows that the underlying string of an mkeps-value is always the empty string (given the regular expression is nullable since otherwise mkeps might not be defined).Proof Both properties are by routine inductions: the first one can, for example, be proved by induction over the definition of derivatives; the second by an induction on r.There are no interesting cases.
Having defined the mkeps and inj function we can extend Brzozowski's matcher so that a value is constructed (assuming the regular expression matches the string).The clauses of the Sulzmann and Lu lexer are If the regular expression does not match the string, None is returned.If the regular expression does match the string, then Some value is returned.One important virtue of this algorithm is that it can be implemented with ease in any functional programming language and also in Isabelle/HOL.In the remaining part of this section we prove that this algorithm is correct.
The well-known idea of POSIX matching is informally defined by some rules such as the Longest Match and Priority Rules (see Introduction); as correctly argued in [21], this needs formal specification.Sulzmann and Lu define an "ordering relation" between values and argue that there is a maximum value, as given by the derivative-based algorithm.In contrast, we shall introduce a simple inductive definition that specifies directly what a POSIX value is, incorporating the POSIX-specific choices into the side-conditions of our rules.Our definition is inspired by the matching relation given by Vansummeren [24].The relation we define is ternary and written as (s, r) → v, relating strings, regular expressions and values; the inductive rules are given in Fig. 2. We can prove that given a string s and regular expression r, the POSIX value v is uniquely determined by (s, r) → v.
Theorem 5 Proof Both by induction on the definition of (s, r) → v.The second part follows by a case analysis of (s, r) → v and the first part.
We claim that our (s, r) → v relation captures the idea behind the four informal POSIX rules shown in the Introduction: Consider for example the rules P+L and P+R where the POSIX value for a string and an alternative regular expression, that is (s, r 1 + r 2 ), is specified-it is always a Left-value, except when the string to be matched is not in the language of r 1 ; only then it is a Right-value (see the side-condition in P+R).Interesting is also the rule for sequence regular expressions (PS).The first two premises state that v 1 and v 2 are the POSIX values for (s 1 , r 1 ) and (s 2 , r 2 ) respectively.Consider now the third premise and note that the POSIX value of this rule should match the string s 1 @ s 2 .According to the Longest Match Rule, we want that the s 1 is the longest initial split of s 1 @ s 2 such that s 2 is still recognised by r 2 .Let us assume, contrary to the third premise, that there exist an s 3 and s 4 such that s 2 can be split up into a non-empty string s 3 and a possibly empty string s 4 .Moreover the longer string s 1 @ s 3 can be matched by r 1 and the shorter s 4 can still be matched by r 2 .In this case s 1 would not be the longest initial split of s 1 @ s 2 and therefore Seq v 1 v 2 cannot be a POSIX value for (s 1 @ s 2 , r 1 • r 2 ).The main point is that our side-condition ensures the Longest Match Rule is satisfied.
A similar condition is imposed on the POSIX value in the P * -rule.Also there we want that s 1 is the longest initial split of s 1 @ s 2 and furthermore the corresponding value v cannot be flattened to the empty string.In effect, we require that in each "iteration" of the star, some non-empty substring needs to be "chipped" away; only in case of the empty string we accept Stars [] as the POSIX value.Indeed we can show that our POSIX values are lexical values which exclude those Stars that contain subvalues that flatten to the empty string.
Next is the lemma that shows the function mkeps calculates the POSIX value for the empty string and a nullable regular expression.Proof By induction on r.We explain two cases.
• Case r = r 1 + r 2 .There are two subcases, namely (a)v = Left v and (s, r 1 \c) → v ; and (b)v = Right v , s / ∈ L(r 1 \c) and (s, r 2 \c) → v .In (a) we know (s, r 1 \c) → v , from which we can infer (c :: s, r 1 ) → inj r 1 c v by induction hypothesis and hence (c :: s, r 1 + r 2 ) → inj (r 1 + r 2 ) c (Left v ) as needed.Similarly in subcase (b) where, however, in addition we have to use Proposition 1(2) in order to infer c :: s / ∈ L(r 1 ) from s / ∈ L(r 1 \c).• Case r = r 1 • r 2 .There are three subcases: For (a) we know (s 1 , r 1 \c) → v 1 and (s 2 , r 2 ) → v 2 as well as From the latter we can infer by Proposition 1(2): We can use the induction hypothesis for r 1 to obtain (c :: Putting this all together allows us to infer (c :: For (b) we know (s, r 2 \c) → v 1 and From the former we have (c :: s, r 2 ) → inj r 2 c v 1 by induction hypothesis for r 2 .From the latter we can infer By Lemma 7 we know ([], r 1 ) → mkeps r 1 holds.Putting this all together, we can conclude with (c :: s, r 1 • r 2 ) → Seq (mkeps r 1 ) (inj r 2 c v 1 ), as required.
Finally suppose r = r 1 * .This case is very similar to the sequence case, except that we need to also ensure that |inj r 1 c v 1 | = [].This follows from (c :: s 1 , r 1 ) → inj r 1 c v 1 (which in turn follows from (s 1 , r 1 \c) → v 1 and the induction hypothesis).
With Lemma 8 in place, it is completely routine to establish that the Sulzmann and Lu lexer satisfies our specification (returning the null value None iff the string is not in the language of the regular expression, and returning a unique POSIX value iff the string is in the language): Proof By induction on s using Lemmas 7 and 8.
In (2) we further know by Theorem 5 that the value returned by the lexer must be unique.A simple corollary of our two theorems therefore is:

Corollary 10
(1) lexer r s = None if and only if v. (s, r) → v (2) lexer r s = Some v if and only if (s, r) → v This concludes our correctness proof.Note that we have not changed the algorithm of Sulzmann and Lu, 6 but introduced our own specification for what a correct result-a POSIX value-should be.In the next section we show that our specification coincides with another one given by Okui and Suzuki using a different technique.

Ordering of Values According to Okui and Suzuki
While in the previous section we have defined POSIX values directly in terms of a ternary relation (see inference rules in Fig. 2), Sulzmann and Lu took a different approach in [21]: they introduced an ordering for values and identified POSIX values as the maximal elements.An extended version of [21] is available at the website of its first author; this includes more details of their proofs, but which are evidently not in final form yet. Unfortunately, we were not able to verify claims that their ordering has properties such as being transitive or having maximal elements.
Okui and Suzuki [16,17] described another ordering of values, which they use to establish the correctness of their automata-based algorithm for POSIX matching.Their ordering resembles some aspects of the one given by Sulzmann and Lu, but overall is quite different.To begin with, Okui and Suzuki identify POSIX values as minimal, rather than maximal, elements in their ordering.A more substantial difference is that the ordering by Okui and Suzuki uses positions in order to identify and compare subvalues.whereby len in the last clause stands for the length of a list.Clearly for every position inside a value there exists a subvalue at that position.
To help understanding the ordering of Okui and Suzuki, consider again the earlier value v and compare it with the following w: Both values match the string xyz, that means if we flatten these values at their respective root position, we obtain xyz.However, at position [0], v matches xy whereas w matches only the shorter x.So according to the Longest Match Rule, we should prefer v, rather than w as POSIX value for string xyz (and corresponding regular expression).In order to formalise this idea, Okui and Suzuki introduce a measure for subvalues at position p, called the norm of v at position p.We can define this measure in Isabelle as an integer as follows where we take the length of the flattened value at position p, provided the position is inside v; if not, then the norm is −1.The default for outside positions is crucial for the POSIX requirement of preferring a Left-value over a Right-value (if they can match the same stringsee the Priority Rule from the Introduction).For this consider and Both values match x.At position [0] the norm of v is 1 (the subvalue matches x), but the norm of w is −1 (the position is outside w according to how we defined the 'inside' positions of Left-and Right-values).Of course at position [1], the norms v [1] and w [1] are reversed, but the point is that subvalues will be analysed according to lexicographically ordered positions.According to this ordering, the position [0] takes precedence over [1] and thus also v will be preferred over w.With the norm and lexicographic order in place, we can state the key definition of Okui and Suzuki [16]: a value v 1 is smaller at position p than v 2 , written v 1 ≺ p v 2 , if and only if (i) the norm at position p is greater in v 1 (that is the string |v 1 p | is longer than |v 2 p |) and (ii) all subvalues at positions that are inside v 1 or v 2 and that are lexicographically smaller than p, we have the same norm, namely The position p in this definition acts as the first distinct position of v 1 and v 2 , where both values match strings of different length [16].Since at p the values v 1 and v 2 match different strings, the ordering is irreflexive.Derived from the definition above are the following two orderings: While we encountered a number of obstacles for establishing properties like transitivity for the ordering of Sulzmann and Lu (and which we failed to overcome), it is relatively straightforward to establish this property for the orderings ≺ and by Okui and Suzuki.

Lemma 11 (Transitivity
Proof From the assumption we obtain two positions p and q, where the values v 1 and v 2 (respectively v 2 and v 3 ) are 'distinct'.Since ≺ lex is trichotomous, we need to consider three cases, namely p = q, p ≺ lex q and q ≺ lex p.Let us look at the first case.Clearly The proof for is similar and omitted.It is also straightforward to show that ≺ and are partial orders, and ≺ is well-founded over lexical values of a given regular expression and given string.Okui and Suzuki furthermore show that they are linear orderings for lexical values [16], but we have not formalised this in Isabelle.It is not essential for our results.What we are going to show below is that for a given r and s, the orderings have a unique minimal element on the set LV r s, which is the POSIX value we defined in the previous section.We start with two properties that show how the length of a flattened value relates to the ≺-ordering.
Proposition 12 Both properties follow from the definition of the ordering.Note that (2) entails that a value, say v 2 , whose underlying string is a strict prefix of another flattened value, say v 1 , then v 1 must be smaller than v 2 .For our proofs it will be useful to have the following properties-in each case the underlying strings of the compared values are the same: One might prefer that statements (4) and ( 5) (respectively ( 6) and ( 7)) are combined into a single iff -statement (like the ones for Left and Right).Unfortunately this cannot be done easily: such a single statement would require an additional assumption about the two values Seq v 1 v 2 and Seq w 1 w 2 being inhabited by the same regular expression.The complexity of the proofs involved seems to not justify such a 'cleaner' single statement.The statements given are just the properties that allow us to establish our theorems without any difficulty.The proofs for Proposition 13 are routine.
Next we establish how Okui and Suzuki's orderings relate to our definition of POSIX values.Given a POSIX value v 1 for r and s, then any other lexical value v 2 ∈ LV rs is greater or equal than v 1 , namely: Proof By induction on our POSIX rules.By Theorem 5 and the definition of LV, it is clear that v 1 and v 2 have the same underlying string s.The three base cases are straightforward: for example for v 1 = Empty, we have that v 2 ∈ LV 1 [] must also be of the form v 2 = Empty.Therefore we have v 1 v 2 .The inductive cases for r being of the form r 1 + r 2 and r 1 • r 2 are as follows: • Case P+L with (s, r 1 + r 2 ) → Left w 1 : In this case the value v 2 is either of the form Left w 2 or Right w 2 .In the latter case we can immediately conclude with v 1 v 2 since a Left-value with the same underlying string s is always smaller than a Right-value by Proposition 13(1).In the former case we have w 2 ∈ LV r 1 s and can use the induction hypothesis to infer w 1 w 2 .Because w 1 and w 2 have the same underlying string s, we can conclude with Left w 1 Left w 2 using Proposition 13(2).• Case P+R with (s, r 1 + r 2 ) → Right w 1 : This case similar to the previous case, except that we additionally know s / ∈ L(r 1 ).This is needed when v 2 is of the form Left w 2 .Since |v 2 | = |w 2 | = s and w 2 : r 1 , we can derive a contradiction for s / ∈ L(r 1 ) using Proposition 2. So also in this case v 1 v 2 .
• Case PS with (s 1 @ s 2 , r 1 • r 2 ) → Seq w 1 w 2 : We can assume v 2 = Seq u 1 u 2 .with u 1 : r 1 and u 2 : r 2 .We have s 1 @ s 2 = |u 1 | @ |u 2 |.By the side-condition of the PSrule we know that either In the latter case we can infer w 1 ≺ u 1 by Proposition 12(2) and from this v 1 v 2 by Proposition 13(5) (as noted above v 1 and v 2 must have the same underlying string).In the former case we know u 1 ∈ LV r 1 s 1 and u 2 ∈ LV r 2 s 2 .With this we can use the induction hypotheses to infer w 1 u 1 and w 2 u 2 .By Proposition 13(4,5) we can again infer v 1 v 2 .
The case for P * is similar to the PS-case and omitted.
This theorem shows that our POSIX value for a regular expression r and string s is in fact a minimal element of the values in LV r s.By Proposition 12(2) we also know that any value in LV r s , with s being a strict prefix, cannot be smaller than v 1 .The next theorem shows the opposite-namely any minimal element in LV r s must be a POSIX value.This can be established by induction on r, but the proof can be drastically simplified by using the fact from the previous section about the existence of a POSIX value whenever a string s ∈ L(r).
Proof If v 1 ∈ LV r s then s ∈ L(r) by Proposition 2. Hence by Theorem 9(2) there exists a POSIX value v P with (s, r) → v P and by Lemma 6 we also have v P ∈ LV r s.By Theorem 14 we therefore have v P v 1 .If v P = v 1 then we are done.Otherwise we have v P ≺ v 1 , which however contradicts the second assumption about v 1 being the smallest element in LV r s.So we are done in this case too.
From this we can also show that if LV r s is non-empty (or equivalently s ∈ L(r)) then it has a unique minimal element: To sum up, we have shown that the (unique) minimal elements of the ordering by Okui and Suzuki are exactly the POSIX values we defined inductively in Sect.3.This provides an independent confirmation that our ternary relation formalises the informal POSIX rules.

Optimisations
Derivatives as calculated by Brzozowski's method are usually more complex regular expressions than the initial one; the result is that the derivative-based matching and lexing algorithms are often abysmally slow.However, various optimisations are possible, such as the simplifications of 0 + r, r + 0, 1 • r and r • 1 to r.These simplifications can speed up the algorithms considerably, as noted in [21].One of the advantages of having a simple specification and correctness proof is that the latter can be refined to prove the correctness of such simplification steps.While the simplification of regular expressions according to rules like is well understood, there is an obstacle with the POSIX value calculation algorithm by Sulzmann and Lu: if we build a derivative regular expression and then simplify it, we will calculate a POSIX value for this simplified derivative regular expression, not for the original (unsimplified) derivative regular expression.Sulzmann and Lu [21] overcome this obstacle by not just calculating a simplified regular expression, but also calculating a rectification function that "repairs" the incorrect value.
The rectification functions can be (slightly clumsily) implemented in Isabelle/HOL as follows using some auxiliary functions: The functions simp Alt and simp Seq encode the simplification rules in (3) and compose the rectification functions (simplifications can occur deep inside the regular expression).The main simplification function is then where id stands for the identity function.The function simp returns a simplified regular expression and a corresponding rectification function.Note that we do not simplify under stars: doing so seems to slow down the algorithm, rather than speed it up.The optimised lexer is then given by the clauses: In the second clause we first calculate the derivative r\c and then simplify the result.This gives us a simplified derivative r s and a rectification function f r .The lexer is then recursively called with the simplified derivative, but before we inject the character c into the value v, we need to rectify v (that is construct f r v).Before we can establish the correctness of lexer + , we need to show that simplification preserves the language and simplification preserves our POSIX relation once the value is rectified (recall simp generates a (regular expression, rectification function) pair): Proof Both are by induction on r.There is no interesting case for the first statement.For the second statement, of interest are the r = r 1 + r 2 and r = r 1 • r 2 cases.In each case we have to analyse four subcases whether fst (simp r 1 ) and fst (simp r 2 ) equals 0 (respectively 1).For example for r = r 1 + r 2 , consider the subcase fst (simp r 1 ) = 0 and fst (simp r 2 ) = 0.By assumption we know (s, fst (simp (r 1 + r 2 ))) → v. From this we can infer (s, fst (simp r 2 )) → v and by IH also (*) (s, r 2 ) → snd (simp r 2 ) v. Given fst (simp r 1 ) = 0 we know L(fst (simp r 1 )) = ∅.By the first statement L(r 1 ) is the empty set, meaning (**) s / ∈ L(r 1 ).Taking (*) and (**) together gives by the P+Rrule (s, r 1 + r 2 ) → Right (snd (simp r 2 ) v).In turn this gives (s, r 1 + r 2 ) → snd (simp (r 1 + r 2 )) v as we need to show.The other cases are similar.
We can now prove relatively straightforwardly that the optimised lexer produces the expected result: Theorem 18 lexer + r s = lexer r s Proof By induction on s generalising over r.The case [] is trivial.For the cons-case suppose the string is of the form c :: s.By induction hypothesis we know lexer + r s = lexer r s holds for all r (in particular for r being the derivative r\c).Let r s be the simplified derivative regular expression, that is fst (simp (r\c)), and f r be the rectification function, that is snd (simp (r\c)).We distinguish the cases whether (*) s ∈ L(r\c) or not.In the first case we have by Theorem 9(2) a value v so that lexer (r\c) s = Some v and (s, r\c) → v hold.By Lemma 17 (1) we can also infer from (*) that s ∈ L(r s ) holds.Hence we know by Theorem 9(2) that there exists a v with lexer r s s = Some v and (s, r s ) → v .From the latter we know by Lemma 17(2) that (s, r\c) → f r v holds.By the uniqueness of the POSIX relation (Theorem 5) we can infer that v is equal to f r v -that is the rectification function applied to v produces the original v. Now the case follows by the definitions of lexer and lexer + .

C. Urban
In the second case where s / ∈ L(r\c) we have that lexer (r\c) s = None by Theorem 9(1).We also know by Lemma 17(1) that s / ∈ L(r s ).Hence lexer r s s = None by Theorem 9(1) and by IH then also lexer + r s s = None.With this we can conclude in this case too.

Extensions
A strong point in favour of Sulzmann and Lu's algorithm is that it can be extended in various ways.If we are interested in tokenising a string, then we need to not just split up the string into tokens, but also "classify" the tokens (for example whether they are keywords or identifiers and so on).This can be done with only minor modifications to the algorithm by introducing record regular expressions and record values (for example [22]): where l is a label, say a string, r a regular expression and v a value.All functions can be smoothly extended to these regular expressions and values.For example (l : r) is nullable iff r is, and so on.The purpose of the record regular expression is to mark certain parts of a regular expression and then record in the calculated value which parts of the string were matched by this part.The label can then serve as classification for the tokens.For this recall the regular expression (r key + r id ) * for keywords and identifiers from the Introduction.With the record regular expression we can form ((key : r key ) + (id : r id )) * and then traverse the calculated value and only collect the underlying strings in record values.With this we obtain finite sequences of pairs of labels and strings, for example (l 1 : s 1 ), ..., (l n : s n ) from which tokens with classifications (keyword-token, identifier-token and so on) can be extracted.
In the context of POSIX matching, it is also interesting to study additional constructors about bounded-repetitions of regular expressions.For this let us extend the results from the previous sections to the following four additional regular expression constructors: between-nm-times We will call them bounded regular expressions.They can be used to specify how many times a regular expression should match.With the help of the power operator (definition omitted) for sets of strings, the languages recognised by these regular expression can be defined in Isabelle as follows: This definition implies that in the last clause r {n..m} matches no string in case m < n, because then the interval {n..m} is empty.While the language recognised by these regular expressions is straightforward, some care is needed for how to define the corresponding lexical values.First, with a slight abuse of language, we will (re)use values of the form Stars vs for values inhabited in bounded regular expressions.Second, we need to introduce inductive the rules for extending our inhabitation relation shown in (2), from which we then derived our notion of lexical values.Given the rule for r * , the rule for r {..n} just requires additionally that the length of the list of values must be smaller or equal to n, that is: Like in the r * -rule, we require with the left-premise that some non-empty part of the string is 'chipped' away by every value in vs, that means the corresponding values do not flatten to the empty string.
In the rule for r {n} (that is exactly-n-times r) we will require that the length of the list of values equals to n.But enforcing in this case that every of these n values 'chips' away some part of a string would be too strong.Therefore matters are bit more complicated in the rule for r {n} .According to the informal POSIX rules we have to allow that there is an "initial segment" that needs to chip away some parts of the string, but if this segment is too short for satisfying the exactly-n-times constraint, it can be followed by a segment where every value flattens to the empty string.One way for expressing this constraint in Isabelle is by the rule: The vs 1 is the initial segment with non-empty flattened values, whereas vs 2 is the segment where all values flatten to the empty string.This idea gets even more complicated for the r {n..} regular expression.The reason is that we need to distinguish the case where we use fewer repetitions than n.In this case we need to "fill" the end with values that match the empty string to obtain at least n repetitions.But in case we need more than n repetitions, then all values should match a non-empty string.This leads to two inhabitation rules for r {n..} : Stars vs : r {n..} Note that these two rules "collapse" in case n = 0 to just the single rule given for r * in the definition shown in (2).We have similar rules for the between-nm-times operator (omitted).These rules ensure that our definition for sets of lexical values LV r s are still finite and also fits with the ordering given by Okui and Suzuki (which require minimal values over the sets LV r s).
Fortunately, the other definitions extend "smoother" to bounded repetitions.For example the rules for derivatives are: For mkeps we need to generate the shortest list of values we can get "away with" given the boundedness constraints.This means for example in the case r {..n} we can return the empty list, like for stars.In the other cases we have to generate a list of exactly n copies of the mkeps-value, because n is the smallest number of repetitions required.The first rule deals with the case when an empty string needs to be recognised.The second when the string is non-empty.In this case the "initial segment" must match non-empty strings only.The idea behind this formulation is to avoid situations where an earlier value matches the empty string, while it is actually possible to "nibble away" some parts of the string.The rules for the other bounded regular expressions are similar.We shall omit them here.With these definitions in place, our proofs given in the previous sections extend to the bounded repetitions.The main point is that there are no surprises.
What is good about our re-use of the Stars-constructor for the values of bounded regular expressions is that we did not need to make any changes to the ordering definitions by Okui and Suzuki.It still holds that our POSIX values are the minimal elements for the lexical value sets, and vice versa.In this way we again obtain independent assurance that our definitions capture correctly the idea behind POSIX matching.
Unfortunately, in our formal proofs in Isabelle/HOL we need to give the definitions and proofs all over again in a separate theory, since there is no way of making Isabelle to accept proofs for the basic regular expressions (defined as inductive datatype) and then augmenting the datatype with new constructors.This would be a really "cool" feature for Isabelle, but we have no idea how this could be achieved elegantly.all derivatives of this regular expression stay below the size of 8 if they are simplified after each step.This is important because functions like nullable and derivative need to traverse regular expressions-if the size of derivatives is too large, then these functions will be slow, abysmally slow that is.There is also work by the same authors on Verbatim++, which is an improvement of the Verbatim lexer (using for example memoization) [7].However, this work has a different focus than ours: their work uses derivatives in order to generate DFAs which are then used for lexing.While this might make the process of lexing faster for the "basic" regular expressions, classic DFAs have problems with bounded regular expressions.For them one has to connect many copies of DFAs, which increases their size and thus slows down the lexing process.As has been shown, derivatives can easily accomodate the bounded regular expressions without the need of having to make copies.
Most recently the work by Moseley et al [24] has been included in the .NET7 regular expression library.They impressively extend Brzozowski derivatives to various anchors (like start-of-line or end-of-string) and lookarounds (like what is coming before or after a matched string).The latter has also been studied by Miyazaki and Minamide [25].Moseley et al already mention a difference between their work and the work described here, namely that properties like L(r 1 • r 2 ) = L(r 1 )@L(r 2 ) do not hold anymore when anchors are added.It remains to be seen how our work can be adapted to such a setting.Another difference between their work and ours is that POSIX lexing is an inherently asymmetric problem, in the sense that it generates longest submatches (recall the Longest Match Rule from the Introduction).This is important for their matching algorithm where they define a reverse operator for regular expressions, POSIX Lexing with Derivatives of Regular Expressions written r r , such that the following property holds: This means the language of the reverse regular expression is the set of reversed strings of L(r ).This property is useful for finding substring matches as it allows Moseley et al to first find the end-location where a substring matches a regular expression and then use _ r in order to find the beginning of the matched substring.The problem with POSIX lexing is that one cannot use the POSIX value for r r and a string rev(s) in order to generate the POSIX value for r and s.We leave a full investigation of what we can adopt from their work to future work.
Our formalisation is available from the Archive of Formal Proofs [3] under http://www.isa-afp.org/entries/Posix-Lexing.shtml.and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
the x is matched by the left-most alternative in this tree and the y by the right-left alternative.This suggests to record this matching as Stars [Left (Char x), Right (Left (Char y))] where Stars, Left, Right and Char are constructors for values.Stars records how many iterations were used; Left, respectively Right, which alternative is used.The value for matching xy in a single 'iteration', i.e. the POSIX value, would look as follows Stars [Right (Right (Seq (Char x) (Char y)))]

Fig. 2
Fig. 2 Our inductive definition of POSIX values

Lemma 7
If nullable r then ([], r) → mkeps r.Proof By routine induction on r.The central lemma for our POSIX relation is that the inj-function preserves POSIX values.Lemma 8 If (s, r\c) → v then (c :: s, r) → inj r c v.

=
Positions are lists of natural numbers.This allows them to quite naturally formalise the Longest Match and Priority rules of the informal POSIX standard.Consider for example the value v v def = Stars [Seq (Char x) (Char y), Char z] At position [0,1] of this value is the subvalue Char y and at position [1] the subvalue Char z.At the 'root' position, or empty list [], is the whole value v. Positions such as [0,1,0] or [2] are outside of v.If it exists, the subvalue of v at a position p, written v p , can be recursively defined by vs [n] ps In the last clause we use Isabelle's notation vs [n] for the nth element in a list.The set of positions inside a value v, written Pos v, is given by Pos (Empty) ]} ∪ {0 :: ps | ps ∈ Pos v} Pos (Right v) def = {[]} ∪ {1 :: ps | ps ∈ Pos v} Pos (Seq v 1 v 2 ) def = {[]} ∪ {0 :: ps | ps ∈ Pos v 1 } ∪ {1 :: ps | ps ∈ Pos v 2 } Pos (Stars vs) def = {[]} ∪ ( n < len vs {n :: ps | ps ∈ Pos vs [n] }) 6 All deviations we introduced are harmless.