Keywords

1 Introduction

Regular expressions (regexes) are a great tool for the pattern matching problem as they can effectively describe pattern structures. Regexes are widely used in software applications such as search engines, text processing, programming languages, and compilers due to their compact representations. Although most developers find that regexes are powerful and flexible tools, they also feel that regexes are very difficult to learn for many reasons such as readability, validity, reliability, and so on [7, 16].

There have been several interesting approaches to automatically grading student submissions in an automata-related course in the online education environment. Alur et al. [2] propose a technique for automatically grading students’ DFA construction in automata courses while generating high-level hints for helping students understand how to correct their wrong submissions. For instance, they introduce the DFA edit difference to compute the amount of difference between the correct DFA and students’ DFA and MOSEL (MSO-equivalent declarative logic) even to capture the case where the student’s submission corresponds to a different logic in MOSEL. Later, D’Antoni et al. [6] utilize the DFA edit difference in order to generate natural language feedback explaining how to correct the submitted DFA. They also conduct an online survey to collect students’ feedback about the quality, usability, and effectiveness of their grading system.

Kakkar [10] studies a similar problem, namely, the problem of grading regexes instead of DFAs. Inspired by the DFA edit difference [2], Kakkar proposes a new criterion called ‘Regex Edit Distance’ which is basically based on the string edit-distance between students’ regexes and correct ones. However, both works suffer from a limitation that ‘optimal’ answers for the problems should be given by TAs as they compare the students’ submissions with the answers for giving partial grades. Recently, D’Antoni et al. [5] propose Automata Tutor v3 (abbreviated to AT v3 hereafter), which is the latest version of the previous work [2]. In AT v3, they include automated grading and feedback generation for a variety of new automata problems including the problems that ask to create regexes, context-free grammars, pushdown automata, and even Turing machines for a given description (e.g., a natural language description, or an automaton, or a grammar that belongs to a different class). However, they also rely on the string edit-distance for grading regexes similar to the work of [10]. Note that AT v3 provides counterexamples of incorrect regexes such as strings that should (or not) be accepted by students as feedback.

In this paper, we introduce an automated grading framework for regular expressions that gives reasonable grades and helpful feedback. The overall structure of our regex grading scheme is illustrated in Fig. 1. As the regex construction problem’s goal is to make regex from the natural language description, TA first assigns the problem by giving the natural description of the problem and the logic formula of the regex which is one of the forms of the regular language. Then students submit the regex corresponding to the given description. Finally, we use three algorithms for generating more convincing partial grades and feedback by comparing the answer logic formula with the submission.

We aim to overcome several remaining limitations that have not been resolved by the earlier approaches. First, we claim that it is not appropriate to grade a student’s regex just by calculating the string edit-distance with the ‘solution regex’. There could be infinitely many regexes that describe the same language. Even when we consider the set of most compact regexes describing the regular language in question, there can be multiple regexes since it is not guaranteed that there is a unique minimal regex for a given regular language. Also, the string edit-distance cannot take the structural similarity into account while we can obtain hierarchical information from the tree form of the regex. Second, we should consider not only the syntactic discrepancies but also the semantic discrepancies arising from the misinterpretation of the problem. In order to compare the logical differences in real-time, the regex must be transformed with the logic and converted to DFA in polynomial time. However, there is no compact logic to do so. Lastly, there is a lack of abundant feedback that helps students study regexes. More detailed feedback such as suggesting the shortest form of the regex, logical differences between the answer and the submission, and organized form of the corner case would be more helpful than simple symbol correction feedback.

Fig. 1.
figure 1

Overview of our automated regex grading framework

In order to resolve the above-mentioned issues, we propose a 3-step regex grading scheme that considers both syntactic and semantic discrepancies between submitted regexes and answer logic formulas (natural language descriptions). More specifically, first, to consider the syntactic discrepancy, instead of comparing a student’s regex with the solution regex, we compare the possible transforms of the student’s regex with the language of the solution. To this end, we apply tree-level edits to the parse tree of the regex to detect the possible syntactic mistakes made by the student. As shown in Fig. 1, after one tree-edit with adding the star operator to student A’s submission \(b+ab^*a\), the edited regex is equivalent to TA’s logic \((b+ab^*a)^*\). Second, we take into account the possibility that a student simply misinterprets the specification of the language. For instance, we may consider that a submitted regex deserves a partial grade if the language expressed by the submission corresponds to a specification that is very similar to the given specification. Therefore, we consider the semantic discrepancy by applying logic-level edits to the logic formula for the specification and searching for a similar specification that exactly corresponds to the student’s regex. In this way, by considering the ‘similarity’ to the student’s regex, we can give a partial grade. For example, after one logic-edit with changing the parameter from ‘a’ to ‘b’ on the TA’s logic, edited logic \(\mathrm{num\_div}(b,2,0)\) is equivalent to the student B’s submission \((a+ba^*b)^*\). Finally, we take some corner cases into accounts such as when the language of a submitted regex misses a reasonably small portion of the target language such as the empty string or a language consisting of a single symbol (\(a^*\) or \(b^*\) when \(\Sigma = \{a, b\}\)). For instance, we can find that \((b^*ab^*ab^*)^*\) cannot generate strings that have zero number of a’s and at least one b while it generates the empty string. Moreover, we generate productive feedback for students using the byproduct of each partial grading algorithm so that they can understand what is wrong with the current submission and how to correct the submission into a correct regex.

The rest of the paper is organized as follows. Section 2 gives some definitions and notations. We introduce a set of declarative logic formulas for describing regular languages in Section 3 and our regex grading scheme in Section 4. The experimental results are provided in Section 5 and Section 6 concludes the paper.

2 Preliminaries

The size of a finite set S is denoted by |S|. Let \(\Sigma \) denote a finite alphabet and \(\Sigma ^*\) denote the set of all finite strings over \(\Sigma \). For \(m \in \mathbb {N}\), \(\Sigma ^{\le m}\) is the set of strings of length at most m over \(\Sigma \). A language over \(\Sigma \) is a subset of \(\Sigma ^*\). Given a set X, \(2^X\) denotes the power set of X. The symbol \(\lambda \) denotes the empty string. We define \(\textrm{mod}(m,n)\) to be \(\{k\mid k \mod m = n, k \in \mathbb {N}\}\). We also define \(\textrm{ind}(w,x) = \{k\mid w[k : k+|x|] = x , k \in \mathbb {N}\}\), where w[i : j] for \(i \le j\) denotes a substring of w concatenating characters of w from index i to \(j-1\), to be the set of indices where x appears in w. Note that the index starts from 1.

A regular expression (regex) over \(\Sigma \) is \(a \in \Sigma \) or the empty string \(\lambda \), or is obtained by applying the following rules finitely many times. For regexes \(R_1\) and \(R_2\), the union \(R_1 + R_2\), the concatenation \(R_1 \cdot R_2\), and the Kleene-star \(R_1^*\) are also regexes.

Now we introduce a formal logic to be used to formally describe languages. Let \(w = w_1 w_2 \cdots w_n\) be a word over \(\Sigma \). For any \(i \in [1,n]\) and a symbol \(a \in \Sigma \), we say that a letter predicate a is true at i in w if \(w_i = a\). For example, the logic formula \(a(x) \wedge \exists y(y > x \wedge b(y))\) means that ‘there is a symbol a at the position x and a symbol b at the position later than x’. It is readily seen that the formula describes the language described by the following regex: \(a (a+b)^* b (a+b)^*\). It is well-known that regular languages are expressible in monadic second-order (MSO) logic [4].

Given a regex R, we define the parse tree T(R) to be the rooted tree representing the hierarchical structure of R. Each leaf is labeled by a symbol in \(\Sigma \cup \{\lambda \}\) and each internal node is labeled by n-ary operations such as \(\cdot \) (concatenation) and \(+\) (union), or unary operation \(*\) (Kleene-star). We define the regex tree edit-distance \(\mathrm{ed_{rt}}(R, R')\) of two regexes R and \(R'\) to be the tree edit-distance between two parse trees of R and \(R'\). Note that the tree edit-distance between T(R) and \(T(R')\) is defined as the minimum number of edit-operations required to transform the tree T(R) into \(T(R')\), where an edit-operations for the regex tree edit-distance can be defined as a substitution of an operation symbol or a character from \(\Sigma \) into a different operation symbol (or a character from \(\Sigma \)), an insertion of a node, or a deletion of a node. It should be mentioned that we perform unordered matching between children of nodes labeled by the union \(+\) operator as the order of elements inside the union operator does not matter.

3 Simple Declarative Logic for Regular Languages

Since MSO logic formulas offer a relatively higher-level specification of regular languages than finite-state automata recognizing the languages, they can be used for describing regular languages in a human-readable format. Moreover, we can always compile an MSO logic formula for a regular language into a corresponding minimal DFA [12] and therefore, a regex as well.

Table 1. A list of regex problems from famous automata textbooks.

As the transformation from MSO to DFA may require the size of the alphabet to grow exponentially in the number of nested quantifiers [8], we restrict our attention to the logic formulas that can describe all regular languages considered in famous automata textbooks without covering the whole regular languages while being able to be converted into a corresponding DFA in polynomial time. Table 2 shows the list of declarative logic formulas considered in this paper. Recall that MOSEL [2], an extension of MSO logic with some syntactic sugar to allow describing regular languages more concisely, is introduced for a similar reason. However, we claim that our logic formulas directly correspond to NL descriptions at a much higher-level and allow us to perform language equivalence tests in practical runtime.

Analogously to the parse tree of a regex, we define the parse tree \(T(\phi )\) for a given logic formula \(\phi \). Here each leaf is labeled by an atomic formula and each internal node is labeled by unary logical connectives \(\lnot \) (negation) or n-ary logical connectives such as \(\wedge \) (conjunction) and \(\vee \) (disjunction). Similarly to the regex tree edit-distance, we also define the logic tree edit-distance \(\mathrm{ed_{lt}}(\phi , \tilde{\phi })\) of two logic formulas \(\phi \) and \(\tilde{\phi }\) as the unordered tree edit-distance between two parse trees of \(\phi \) and \(\tilde{\phi }\). Note that we allow the substitution of an atomic logic formula and two logical connectives, conjunction, and disjunction, for the logic tree edit-distance. We also allow the insertion and deletion of negation. The substitution of an atomic logic formula is available for a single parameter such as strings xy, non-negative integers mn, and a comparison operator \(\square \in \{ >, =, <\}\). While the edit cost of the substitution of a logical connective equals 1, we assign the string edit-distance for the substitution of a string parameter, the numerical difference for an integer, and the value 1 for the substitution of a comparison operator.

We provide a list of regex problems and solutions collected from famous automata textbooks in Table 1. For each problem, we provide a natural language description for a regular language in question, a solution regular expression given in the textbook, and the corresponding logic formula found by us. We denote \(a+\lambda \) by \(a^?\) for brevity.

Table 2. A list of declarative logic formulas used to describe regular languages that appear in famous automata textbooks, where \(m, n \in \mathbb {N}\), \(a,b\in \Sigma \), \(x, y \in \Sigma ^*\) , and \(\square \in \{ >, =, <\}\). In the set notation, we broadcast \(+ n\) and \(- n\) for some integer n to each element of the given set.

4 Regex Grading Algorithm

In this section, we explain our automated regex grading algorithm by considering both syntactic and semantic properties.

Table 3. Examples of incorrect regexes for ‘Even number of a’s’, which has a possible solution \((b+ab^*a)^*\).

4.1 Grading of Regexes

Let us assume that exact logic formulas for regular languages asked in questions are already known as teachers always can specify the regular languages with the provided logic formulas in Table 2. We aim at grading the submitted regex in terms of two types of syntactic correctness and a set of counterexamples as follows:

Syntactic grading Recall that previous approaches to computing the syntactic similarity or dissimilarity between two regexes rely on string edit-distance between two regexes. However, the string edit-distance between two regexes does not take the structural similarity into account. We instead use the tree edit-distance between two parse trees of regexes as the tree edit-distance better reflects the structural similarity of regexes. One of the advantages of using the tree edit-distance is that we can also easily identify semantically equivalent regexes when they are viewed as parse trees rather than as strings.

Then, we define the syntactic grade of R based on the minimum tree edit-distance between R and an unknown regex \(\tilde{R}\) such that \(L(\tilde{R}) = L(\phi )\). Formally speaking, the syntactic grade of R is defined as follows:

$$\begin{aligned} G_\textrm{syn} = G_\textrm{full} -w_\textrm{syn}(R) \cdot \min \{ \mathrm{ed_{rt}}(R, \tilde{R}) \mid L(\tilde{R}) = L(\phi ) \}, \end{aligned}$$
(1)

where \(G_\textrm{full}\) means the full grade (10 in our implementation). The function \(w_\textrm{syn}\) scales the deduct points based on the length of the submitted regex R because if R is very long and it requires a single edit, then we may consider that R is syntactically similar enough to a solution.

Let us explain the detailed procedure for computing \(G_\textrm{syn}\). We first parse the regex R as a binary tree and construct the set \(S_{R,n} = \{ \tilde{R} \mid \mathrm{ed_{rt}}(R, \tilde{R}) \le n \}\) of regexes where each regex is within the tree edit-distance n (\(n=2\) in our experiments). Note that we use tree edit-distance instead of string edit-distance used in AT v3 and RegED as the tree edit-distance makes more sense to compute the syntactic difference between two regexes. For instance, the tree edit-distance between \(a +b\) and \((b+a)^*\) is one while the string edit-distance is five.

For running the above procedure more efficiently, we increment the value of n from zero by one at each iteration until we find such \(\tilde{R}\). We also check whether or not the current regex is already examined in the previous iteration by comparing the parse trees of regexes so that our implementation can avoid redundant regex equivalence tests.

Logical grading Given a problem ‘A regex for strings where the string aba appears at 3th position.’, a student may submit an incorrect solution \((a+b)aba(a+b)^*\) by making a mistake of reading the number ‘3’ as ‘2’. Because the most plausible answer is \((a+b)(a+b)aba(a+b)^*\), the student’s submission is likely to receive no partial grade according to the syntactic grading, which could be a harsh decision for an elementary mistake. However, if we semantically compare the submission and the problem, there is a hope to receive a partial grade as they turn out to be very similar in terms of corresponding logic formulas \(\textrm{pos}(aba, 2)\) and \(\textrm{pos}(aba, 3)\).

The main challenge in logical grading is to find a logic formula that corresponds to the submitted regex such that we can effectively quantify the amount of semantic discrepancy between the submitted regex and the problem. Given a regex, it requires a considerable amount of computation for finding a logic formula described as a logical combination of formulas provided in Table 2, assuming that the only feasible approach is an exhaustive tree search. Even worse, it is not always possible to find such a corresponding logic as the provided set of logic formulas cannot cover the entire class of regular languages. In order to save computation time, we instead utilize the solution logic formula by applying tree-level edits to the parse tree of the solution logic formula at most n times (again, \(n=2\) in our implementation) and checking whether the edited formula is language-equivalent to the submitted regex.

If we manage to find a logic formula \(\tilde{\phi }\) that corresponds to the submitted regex, then the logical grade of R is then computed as follows:

$$\begin{aligned} G_\textrm{log} = G_\textrm{full} - w_\textrm{log}(\phi ) \cdot \min \{ \mathrm{ed_{lt}}(\phi , \tilde{\phi }) \mid L(\tilde{\phi }) = L(R) \}. \end{aligned}$$
(2)

Corner case grading In some cases, the submitted regex may describe a very similar language to the language in question although the regex is syntactically different (e.g., tree edit-distance is larger than n). For instance, let us consider a problem with the following description: “Strings with even number of a’s.” provided in Table 3. The language described by a regex \((b^*ab^*ab^*)^*\) is quite similar to the described language except for strings only with b’s. In order to check whether the submitted regex deserves a corner case partial grade, we construct two DFAs for the following languages: \(L(R) \cap \overline{L(\phi )}\) and \(\overline{L(R)} \cap L(\phi )\). The language \(L(R) \cap \overline{L(\phi )}\) is the set of strings that can be described by R and not by \(\phi \) (false positive examples). On the contrary, \(\overline{L(R)} \cap L(\phi )\) captures the set of strings that are described by \(\phi \) but not by R (false negative examples). We enumerate the strings from both DFAs by using the enumDFA function in FAdo library in lexicographical order and display them to users to make them understand why their submissions are not correct by counterexamples.

We also assign a corner case grade \(G_\textrm{cor} = \frac{4}{5} \times G_\textrm{full}\) if false positive and false negative sets satisfy one of the following conditions::

  1. 1.

    There is only \(\epsilon \) in either false positive or negative set.

  2. 2.

    There are only less than m false positive and negative strings.

  3. 3.

    \(L(R) \cup \{ a^*\} = L(\phi )\) or \(L(R) \cup \{ b^*\} = L(\phi )\).

4.2 State Complexity of Logic Formula’s DFAs

It is easy to see that all atomic logic formulas presented in Table 2 can be represented by DFAs of size linear in the lengths of string parameters. In the following proof, \(m, n \in \mathbb {N}\), \(a,b\in \Sigma \), \(x, y \in \Sigma ^*\) , and \(\square \in \{ >, =, <\}\).

Proposition 1

For each atomic logic formula \(\phi \) in Table 2, we can construct a DFA recognizing \(L(\phi )\) with a polynomial number of states in |x| and |y|.

Fig. 2.
figure 2

An NFA for \(\mathrm{pos\_rev}(a, n)\).

While most of the formulas in Table 2 can be represented as DFAs of size linear in the numerical parameters m and n as well, there are two exceptions: ‘\(\mathrm{pos\_rev}(x,n)\)’ and ‘\(\mathrm{pos\_every\_rev}(x, m, n)\)’.

Proposition 2

For each atomic logic formula \(\phi \) in Table 2 except \(\mathrm{pos\_rev}(x, n)\) and \(\mathrm{pos\_every\_rev}(x, m, n)\), we can construct a DFA recognizing \(L(\phi )\) with a polynomial number of states in m and n.

Unlike the other formulas, the state complexity of \(\mathrm{pos\_rev}(x, n)\) and \(\mathrm{pos\_every\_rev}(x, m, n)\) is exponential in n in the worst case.

Lemma 1

The state complexity of \(\mathrm{pos\_rev}(x, n)\) is exponential in n.

Proof

Since the NFA construction for \(\mathrm{pos\_rev}(x, n)\) requires \(|x| + n + 1\) states, we have a simple upper bound \(2^{|x| + n + 1}\) which is exponential in n for the state complexity of \(\mathrm{pos\_rev}(x, n)\).

The simplest example where the lower bound is also exponential in n is when x is a string of length one such as a or b. See Fig. 2 for an NFA accepting the regular language pos_reverse(an). Since the initial state \(q_0\) has a self-loop labeled by \(\Sigma \), it is easy to see that the upper bound of the state complexity is \(2^n\) as \(q_0\) is always in the state set in the subset construction.

Now we will show that the upper bound \(2^n\) can be reached by describing how we can reach any subset of states from \(2^{\{q_1, q_2, \ldots , q_{n+1}\}}\). Let us consider a state set \(P = \{ q_{s_1}, q_{s_2}, \ldots , q_{s_k} \}\), where \(s_i < s_j\) for \(1 \le i < j \le k \le n+1\). Then, we can reach P by reading the following string:

$$ ab^{s_k - s_{k-1}-1}ab^{s_{k-1} - s_{k-2} - 1} \cdots ab^{s_1 -1}. $$

Since it is easy to see that all states in \(2^{\{q_1, q_2, \ldots , q_{n+1}\}}\) are pairwise distinguishable, we conclude that the state complexity of \(\mathrm{pos\_rev}(a, n)\) is \(2^n\).

Now the following state complexity is obvious from the above observation.

Proposition 3

The state complexity of \(\mathrm{pos\_every\_rev}(x, m, n)\) is exponential in n.

4.3 Heuristics for Faster Computation

In order to avoid this exponential blow-up in the size of DFAs, we employ the following two heuristics for faster computation of grades.

Regex reverse trick Interestingly, we can avoid this exponential blow-up caused by \(\mathrm{pos\_rev}(x, n)\) by reversing the given regex and the logic formula at the same time. We can trivially reverse the regex while maintaining the length and construct polynomial-sized DFAs for all reversed logic formulas except \(\textrm{pos}(x, n)\). For instance, suppose that we are given a regex R and a declarative logic formula \(\phi \) as follows:

$$\begin{aligned} R&= a(a+b)b^*b \text { and }\\ \phi&= \mathrm{pos\_rev}(b, n) \wedge \textrm{len}(>, 3) \wedge \textrm{num}(a, >, 1). \end{aligned}$$

In order to avoid the exponential blow-up by \(\mathrm{pos\_rev}(x, n)\), we reverse R and \(\phi \) as follows:

$$\begin{aligned} R'&= bb^*(a+b)a \text { and }\\ \phi '&= \textrm{pos}(b, n) \wedge \textrm{len}(>, 3) \wedge \textrm{num}(a, >, 1). \end{aligned}$$

Note that the logic such as \(\textrm{len}(\square , n)\) and \(\textrm{len}(x, \square , n)\) are reversal-invariant.

Concise Normal Form Recall that we construct a set of regexes from a submitted regex R by applying parse tree level edits for computing the syntactic grade. The main computational bottleneck comes from the repetitive regex equivalence tests as there are too many regexes in the set. In order to reduce the size of the constructed set, we employ the concise normal form [11] of regexes which are proven to be useful to sufficiently reduce the number of redundant regexes. For instance, we inductively apply substitution rules for subregexes such as \(R^*R \rightarrow RR^*\), \(R^*R^* \rightarrow R^*\), \(R + R^* \rightarrow R^*\), \((R^*)^* \rightarrow R^*\) for concise regex representation and pruning of redundant regexes.

4.4 Description of Regex Grading Algorithm

Algorithm 1 precisely describes the whole procedure for computing the final grade of a student’s regex R for a problem corresponding to a declarative logic formula \(\phi \). First, we preprocess the given student’s regex R and declarative logic formula using the normal form and reverse trick for faster computation and convert them into the DFAs for partial grading. If the submission is equivalent to the solution, then give 10 points. If not, give the highest point among the three partial grades.

figure a

4.5 Converting Regex to NL Description

Many researchers have studied the problem of translating an NL description into a corresponding regex [13, 15, 17]. Here we examine a dual problem, namely, the problem of converting a regex into an NL description (Regex2NL) to help regex learners easily understand the language accepted by the given regex. Consider \((b+ab^*a)^*\) for an example again. Instead of merely translating the semantics of regex operators and symbols, our goal is to generate an ‘easy-to-understand’ NL description such as ‘even number of a’s’ which corresponds to a logic formula defined in Table 1.

Our approach involves two steps, where we first find a logic formula corresponding to the regex and then translate the logic formula into an NL description by rules. It is worth noting again that there are regexes that cannot be effectively described by our logic. Therefore, it is not always possible to find a corresponding logic from a given regex even if we enumerate all logic formulas. Even if there exists a corresponding logic for the given regex, it takes too much time (more than one minute in general) for practical use in most cases. Hence we propose to use a deep learning-based approach that can predict a logic formula from a given regex with reasonably high accuracy in practical runtime (less than one second).

First, we train the Regex2Logic model that translates a regex to a logic formula using a sequence-to-sequence neural network with attention mechanism [3]. For training our Regex2Logic model, we use a dataset consisting of 13,437 pairs of regexes and logic formulas that are collected by time-consuming enumerations of regexes and logic formulas, and regex templates. We construct the regex-logic pair dataset for training our Regex2NL model which translates a given regex into a logic formula defined by using our simple declarative logic formulas. We collect the pairs by time-consuming enumerations of regexes and logic formulas and regex templates. We split the pairs into the ratio of 8:1:1 for training, validation, and test sets. We explain each process in more detail as follows:

  1. 1.

    Regex enumeration: enumerate regexes from the simplest one to more complex ones by increasing the depth of parse trees of regexes and searching for corresponding logic formulas until pre-defined thresholds (two for the depth, three for the length of argument strings and integers) for the complexity of logic formulas are reached.

  2. 2.

    Logic formula enumeration: enumerate atomic logic formulas by varying the arguments such as strings of length up to n and integers from 1 to n and find a corresponding regex by exhaustively enumerating regexes.

  3. 3.

    Regex template: use regex templates for which we can easily match corresponding logic formulas. For instance, regexes with no operator such as aba correspond to the logic \(\mathrm{single\_word}(aba)\).

Table 4 shows the statistics of our dataset, especially in terms of the distribution of logic formulas used. The conjunction or disjunction of the same logic formulas is counted as a conjunction or disjunction.

Table 4. Statistics of the constructed regex-logic pair dataset used to train our Regex2NL model. \(\phi , \phi _1,\) and \(\phi _2\) denote atomic logic formulas found by enumerations of regexes and logic formulas or regex templates.

In order to construct a set of regex-logic pairs, we can manually define a regex in a generalized form for each logic formula with arbitrary arguments. We rely on the following list of regex templates for generating various regexes by changing arguments of the templates:

  • \(\textrm{pos}(x, n) : \sigma ^{(n-1)}x\sigma ^*\)

  • \(\mathrm{pos\_rev}(x, n) : \sigma ^* x^R\sigma ^{(n-1)}\)

  • \(\textrm{len}(=, n) : \sigma ^n\)

  • \(\textrm{len}(<, n) : (\sigma +\lambda )^{n-1}\)

  • \(\textrm{len}(<, n) : \sigma ^?+\sigma ^2+\sigma ^3+...+\sigma ^{n-1}\)

  • \(\textrm{len}(>, n) : \sigma ^{n+1}\sigma ^*\)

  • \(\mathrm{len\_div}(x, m, n) : \sigma ^{n}(\sigma ^m)^*\)

  • \(\mathrm{len\_div}(x, m, n) : (\sigma ^m)^*\sigma ^{n}\)

By applying enumerated strings and integers as arguments, we can collect many regex-logic pairs. Once we discover the initial set of regex-logic pairs, we augment the data by combining the regexes and logic formulas with a regex operator \(+\) and a logical connective \(\vee \), respectively.

Note that our Regex2NL achieves about 92.3% prediction accuracy for the test set. For 167 incorrect regex submissions from students, our logical grading module finds 21 logic formulas that are within logic tree edit-distance two from the solution logic formula. Among the remaining 146 regexes, our model predicts 39 logic formulas that actually correspond to given regexes. We can provide natural language descriptions for 35.9% of the incorrect submissions from the logical grading module and the Regex2Logic model. We believe it is very useful to provide ‘easy-to-understand’ NL descriptions on 35.9% of submissions using our Regex2NL model, while most regexes do not have corresponding logic formulas definable by the proposed set of simple declarative logic formulas as we already discussed.

Then, we can transform the logic formula given by Regex2Logic to the natural language description with the heuristic template. We can make a template easily, as the logic formula has the characteristic of the natural language. We can use the entire framework of Regex2NL not only for feedback on incorrect submissions but also for making the random regex problem. For example, we can make the random regex first with regex enumeration of the regex template, then we can translate the regex to the natural language description. We can make the pair of regex-NL for using the regex problem.

4.6 Feedback Generation

There are natural types of feedback such as binary feedback (correct/wrong), an example, and a natural language-based conceptual hint. Binary feedback is the simplest yet necessary feedback that should be provided to students who submitted regexes. We can also simply generate a counterexample if the submitted regex is not correct. We focus on generating a natural language-based conceptual hint that describes the discrepancy between the desired solution and the submitted solution in an easily understandable manner.

When the submitted regex is not correct, there can be two cases as follows. First, the submitted regex should be slightly revised in order to accept the desired language. In this case, the most desirable feedback may be the way to revise the submitted regex. Second, the submitted regex accepts a semantically different language than the desired language as the student may have misinterpreted the question. Then, we may need to inform the student about the semantic discrepancy between the language described by the submitted regex and the desired regular language in an easily understandable manner.

For the first case, we provide the regex edit sequence between the submitted regex R and a regex \(R'\) which is syntactically closest (with the smallest regex edit-distance) to R while accepting the regular language specified in the problem. For the second case, we suggest the logic edit sequence between the logic formula \(\phi \) corresponding to R and a logic formula \(\tilde{\phi }\) specified in the problem. If the problem asks a regular language “strings containing a substring abab at least once” which corresponds to \(\textrm{num}(abab, >, 0)\) and the submitted regex captures a regular language corresponding to \(\textrm{num}(ab, =, 0)\), then we provide the following feedback: “Consider substring abab instead of ab and operator > instead of \(=\).”

4.7 Converting Logic Formulas to NL Descriptions

Table 5. Natural language descriptions of our declarative logic formulas.

Table 5 shows the NL descriptions for each atomic logic formula used in the rule-based translation of logic formulas into NL descriptions. When a logic formula is formed by combining more than two atomic formulas \(\phi _1\) and \(\phi _2\) using logical connectives, we simply combine the corresponding NL descriptions. For example, let \(\textrm{NL}(\phi )\) be the NL description of an atomic logic formula \(\phi \) following the rules in Table 5. Then, NL(\(\phi _1 \wedge \phi _2\)) is defined as ‘The set of strings that satisfy the following conditions: ‘\(\textrm{NL}(\phi _1)\)’ and ‘\(\textrm{NL}(\phi _2)\)’.

Using this, we present regexes in more concise form even when the submitted regex is correct. Let us consider the problem ‘all runs of a’s have lengths that are multiples of three’. Note that a regex \((aaa+b)^*\) can be a solution. If a student submits \((aaa+b^*)^* + b^*\) as a solution, then the system should give the full grade since the submitted regex recognizes the desired regular language. While assigning a full grade to the submission, our algorithm provides \((aaa+b)^*\) to the student by computing the concise normal form [11] of the submission so that the student can recognize that there is a better solution (in terms of syntactic conciseness).

5 Experiments

We recruited 20 undergraduate students who were taking or had taken an automata course at the time of conducting our research, and ran our automatic grading algorithm on students’ regex submissions for ten selected exercises from famous automata textbooks [9, 14, 18]. In order to compare the results of automated grading with the previous approaches including RegED [10] and AT v3 [5], we implemented the algorithms in Python 3 on our own and used them for comparison. We cannot use the existing implementations directly, because they do not support a feature of adjusting the maximum number of allowed edits, and not all of them are supported as a tool. We utilized the Python 3 portFootnote 1 of the FAdo [1] package, which is an open-source library for the symbolic manipulation of automata and other computation models. We also restricted the number of edits allowed for partial grades to two in our algorithm and AT v3, and one in RegED since RegED applies edits from both solutions and submissions.

Table 6. Performance comparisons of the proposed grading algorithm with baseline algorithms proposed in previous works [5, 10].
Table 7. Grading and feedback examples generated by our regex grading algorithm for problems in Table 1. We denote \(a+b\) by \(\sigma \) for brevity.

5.1 Main Results

Table 6 shows the experimental results in terms of the statistics of grading results. We present the ratio of submissions that received partial grades by the considered grading algorithms in ‘Partial Total’ column. The ‘Partial \(G_\textrm{syn}\)’ column shows the ratio of regexes that received a partial ‘syntactic grade’ by AT v3, RegEd, and our syntactic grading algorithms over all regexes. Since AT v3 and RegED only consider syntactic grading, values in this column show the ratio of regexes that received partial grades over all regexes. On the other hand, ‘Partial \(G_\textrm{log}\)’ column shows the ratio of regexes that received a partial ‘logical grade’ by our algorithm over all regexes. It is seen that AT v3 and RegED fail to assign partial grades to some regexes as they only consider syntactic differences with solution regexes, not the logic formulas behind the problem descriptions. Note that higher partial grades do not always mean that the grades are ‘well-deserved’. It is important whether the partial grade is convincing. We will explain in the following section why RegED gives more partial grades than ours and why giving more partial grades cannot be a good choice.

To put it briefly, RegED gives partial grades to more regexes (45.3%) than AT v3 (30.2%) and even ours (40.7%). Table 7 shows several examples of the grades and feedback examples for students’ submissions to the five problems in Table 1.

5.2 Validity of Grading Results

In order to verify that our algorithm indeed assigns partial grades to submissions that are ‘well-deserved’, we provide several reasons.

First, we can find logical partial grades while AT v3 and RegED cannot. We demonstrate two examples for the case. For the problem with the following description ‘even number of a’s’, our algorithm assigns a partial grade to the submission \((a+ba^*b)^*\) while there is a possible solution \((b+ab^*a)^*\). Our logical grading module gives a partial grade, as it is possible that the student makes a simple mistake of confusing a with b. For the problem ‘contains at most three a’s’, our algorithm assigns a partial grade to \(b^*(a+\lambda )b^*(a+\lambda )b^*(a+\lambda )b^*(a+\lambda )b^*\) while one of the possible solutions is \(b^*(a+\lambda )b^*(a+\lambda )b^*(a+\lambda )b^*\). This is again possible due to our logical grading module, as the student could have confused numbers.

Second, our syntactic grading gives some partial grades with tree-edit while others cannot. For example, our syntactic grading gives a partial grade to \((b^*a^*)abab(b^*a^*)\) for the problem ‘contains the substring abab’ as we may insert two star operators for the occurrences of \((b^*a^*)\). However, RegED and AT v3 will not assign a partial grade if they are provided \((a +b)^*abab(a+b)^*\) and \((b+a)^*abab(b+a)^*\) as possible solutions while our algorithm uses logic as a solution. This is because RegED utilizes only one solution regex for comparing with the submitted regex and it allows edits from both the solution and the submitted regex. RegED performs an edit at solution regex and submitted regex, respectively, to improve speed, but if solution regex is not given in an ideal form as in the above example, RegED cannot grade properly. To solve this problem, all possible variants of solution regex must be considered for editing and comparing and this leads to significant time-consuming. We can compare with every possible candidate without additional time, as our regex grading uses logic for the solution and permits the edit only in submission regex.

Third, the string edit used by RegED tends to cover too many candidates rather than our tree edit. For instance, it can change \(a+b+c\) to \(a^*b+c\) and \(aab+c\) with a single edit. This may differ depending on the TA’s point of view, but we believe that the edit should be conducted more strictly due to the perspective of the tree structure, the original property of regex. Since given edits are more fluid than the tree edit, it allows more areas to be covered by edit, which is not considered the intended edit, suggesting that giving a lot of partial grading is not always the right direction. Assigning higher partial grades is not always the right direction, as it often jumps ahead of what we intended.

5.3 Comparison with TA Partial Grade

Table 8. Evaluation for the similarity with TA partial grades.

Table 8 demonstrates how the grading results by the algorithms align well with the human TAs’ grading results. We ask five human TAs to give grades to 167 incorrect regex submissions by students. First, we calculate the precision, recall, and F1 score for each algorithm and for each TA. Precision is the percent of partial grades by the algorithm that matches the TA and recall is the percent of TA partial grades that the algorithm agrees with. Then we get an average score comparing the grading results with each result of human TAs. Since correct submissions should always receive full marks, we only consider incorrect submissions and check whether or not human TAs gave partial grades to the submissions. In other words, we assume that human TAs always make the right decisions in terms of giving partial grades to incorrect submissions and consider the cases where the partial grades are given as positive cases. We can see that the results in the ‘Precision’ column imply how the algorithms ‘carefully’ select submissions that deserve partial grades and the ‘Recall’ column show that the algorithms do not miss such cases.

Overall, our grading algorithm shows the best performance in terms of the F1 score, which is the harmonic mean of precision and recall. Then, RegED is places in the second position with a tiny gap between our algorithm and AT v3 following it.

Intuitively, it is natural that the recall is highest in RegED as RegED covers more regexes than the other compared algorithms. We can also see from the high precision of the logical grading module that the partial grade submissions captured by the logical grading module are quite precise even compared with the other modules used in our algorithm. However, the logical grading fails to capture the regexes that received partial grades by TAs from the other algorithms. On the other hand, the syntactic grading can capture much more regexes that received partial grades by TAs than the other modules in our algorithm. This also shows that human TAs tend to give partial grades to submissions with syntactic mistakes rather than to submissions with logical mistakes.

5.4 Effectiveness of the Regex Reverse Trick

We demonstrate the effectiveness of the reverse trick in terms of runtime complexity reduction of the proposed algorithm in Fig. 3. There is no noticeable difference in short regexes. However, we can find that the time increases to log scale as the length of the regex increases.

Fig. 3.
figure 3

Runtime comparison w/wo reverse trick. \(s_n\) and \(c_n\) indicate problems corresponding to logic formulas \(\mathrm{pos\_rev}(a, n)\) and \(\mathrm{pos\_rev}(a, n) \wedge \textrm{num}(bba, >, 0)\), respectively.

5.5 User Study

In Fig. 4, we provide a screenshot of a web page for the online ‘Regex Trainer’ in which our regex grading algorithm is employed. In the online Regex Trainer page, the system displays each regex construction problem in turn to a student. If the student inputs his/her answer for the problem, then the system shows the grade with feedback and displays the next problem.

Fig. 4.
figure 4

A screenshot taken from the web page of online ‘Regex Trainer’ where our automatic grading module is used inside.

We conducted a user study by asking five questions to nine students who performed tests on the usability and usefulness of our regex grading algorithm. The result is shown in Table 9. Each student is asked to give their answer to each question on a Likert scale from 1 (strongly disagree) to 5 (strongly agree). The result shows that average scores for the five questions are all in the range of [3.7, 4.4], which implies that the students in general find our grading system easy-to-use and useful for studying regexes.

Table 9. Student survey result. Nine students gave their judgments for the following five questions on a Likert scale from 1 to 5.

5.6 Limitations

In the following, we leave a list of limitations of our study. First, the proposed set of logic formulas cannot express the entire class of regular languages. In future work, we may extend the set of formulas by adding useful logic formulas that are suitable for potential regex construction problems. Second, there could be another approach to catch student’s ‘mistakes’. We suggest three partial grades that catch syntactic, logical, and corner case mistakes. Finding a new cause of mistakes can provide richer and more detailed feedback for students. Moreover, it is very likely that our grading algorithm takes too much time if the submitted regex is unnecessarily long since in this case the number of regexes that should be examined would increase exponentially.

6 Conclusions

Due to the transition from face-to-face teaching to online, distance learning, the importance of developing an automated grading system has become more evident. We have presented an efficient and powerful automated grading algorithm for regexes in undergraduate automata and formal language courses. Our algorithm takes students’ regex submissions and assigns appropriate grades with productive feedback to the regexes by considering the syntactic and semantic alignment between the submitted regexes and the problem definition. Moreover, by employing several heuristics such as the reverse trick and intermediate regex simplification, we could have reduced the runtime complexity for the repetitive regex equivalence tests for grading regexes.