Matching Regular Expressions on uncertain data

In this paper we study regular expression matching in cases in which the identity of the symbols received is subject to uncertainty. We develop a model of symbol emission and uses a modification of the shortest path algorithm to find optimal matches on the Cartesian Graph of an expression provided that the input is a finite list. In the case of infinite streams, we show that the problem is in general undecidable but, if each symbols is received with probability 0 infinitely often, then with probability 1 the problem is decidable.


Introduction
Regular expressions are a useful and compact formalism to express regular languages, and are frequently used in text-based application such as text retrieval, query languages or computational genetics. Approximate string matching is one of the classical problems in this area [1]. Given a text of length n, a pattern of length m and a number k of errors allowed, we want to find all the sub-strings in the text that match the pattern with at most k errors. If the text is not known in advance (viz., if the algorithm must work on-line, without pre-processing the text), then dynamic programming can provide a solution of complexity O(mn) [18,26], while improved algorithms can run in O(kn) [10,31,32].
Regular expressions can be used as pattern detectors in more general situations, such as activity detection [5]. In this context, the approximation problem takes a new form: the problem is not just matching despite the absence of expected symbols or the presence of spurious ones. The problem is that, in many applications, the identity of the symbols received is uncertain, and known only probabilistically. That is, at each input position, rather than having a symbol drawn from an alphabet , we have a probability distribution on . The problem, in this case, is to find the most likely sequence of symbols that matches the expression.
In this paper, we present algorithms to solve this problem, and we study their properties, both for matching sub-strings of finite strings and of infinite streams.
Our matching model is in some measure related to Markov models used for sequence alignment, a technique quite common in bioinformatics [16]. In particular, our model bears some resemblance to Profile Hidden Markov Models (PHMM: Markov models with states representing symbol insertion and symbol deletion) for multiple alignments of sequences [8,27]. In both PHMM and our algorithms. matching can be seen as traversing a maximal path with additive logarithmic weights. PHMM have been developed to align sequences with gaps and insertions; it should be in principle possible to extend them to matching regular expressions, but the derivation of a PHMM from an expression appears to be quite complex.
Weighted automata [7] have also been used for problems related to ours. As a matter of fact, the Cartesian graph, which we use in this paper, can be seen as an equivalent formalism and as an implementation of matching using weighted automata. Graphs provide a more direct implementation and a simple instrument for studying the properties of the methods.
Early work on infinite streams has generally focused on the recognition of the whole infinite sequence (ω-word): an ω-word is accepted ih the automaton can read it while going through a sequence of states in which some final state occurs infinitely often (Büchi acceptance, [28,29]), an approach that has been extended to infinite trees [21,22]. The problem that we are considering here is different in that we are trying to match finite sub-words of an infinite word. This problem, without dealing with uncertainty, was considered in [25].
Matching with uncertain symbols-the problem that we are considering here-is gaining prominence in fields in which uncertainty in the data is the norm due to the imprecision of detection algorithms. The detection of complex audio or video events is an example. Some attempts at the definition of high-level languages for video events were made in the 1990s using temporal logic [6], Petri Nets [11] or semi-orders [2]; they had little impact at the time due to the relative immaturity of detection techniques and to the paucity of video data sets available.
With the progress of detection techniques and the availability of more data to train sophisticated classifiers, things have begun to change, and researchers "have started working on complex video event detection from videos taken from unconstrained environment (sic), where the video events consist of a long sequence of actions and interactions that lasts tens of seconds to several minutes" [17]. These new possibilities open up opportunities for video event detection but also new semantic problems [12,19,20].
In this new scenario, researchers have begun to explore complex event languages. Francois et al. [9] define complex events from simple ones using an event algebra with operations such as sequence, iteration, and alternation. In [15] and [23] stochastic context-free grammars are used, while in [13] event models are defined using case frames. As in other cases, these systems assume that different events are separated (no event is part of another one) and that their length is known, thus eschewing the length bias and the decidability problems that figure prominently in this paper.
In our model, we consider the alphabet symbols as elementary events that the system can recognize (we assume that there are a finite number of them) and whose detection is subject to uncertainty, so that the uncertainty of event detection translates to an uncertainty over which symbol is present in input. We assume that the information that we have can be represented as a stochastic observation process ν, where ν[k](a) is the probability that the alphabet symbol a were the kth symbol of the input sequence.
Within this general framework, we consider the following problems: Finite estimation: we consider a finite sequence of uncertain input symbols (that is, a finite stochastic process on ), called the observation. Assuming that at least one sub-string of the sequence matches the expression, which is the most likely matching sub-string given the observation? Finite matching: given a finite number of observations, what is the probability that at least one sub-string matches the expression? Infinite estimation: we show that, in general, estimation is undecidable in infinite stream. However, if for each symbol the probability of observing it is zero infinitely often, then with probability one estimation can be decided in finite time.
The paper is organized as follows. In Sect. 2 we remind a few facts about regular expression in order to establish the language and the basic facts that we shall use in the rest of the paper. In Sect. 3 we present a matching algorithm based on the Cartesian Graph; although this algorithm is equivalent to standard NFA algorithms, it provides a more convenient formalism to discuss the extension to uncertain data. In Sect. 4 we present our model of uncertainty, modeling it as the emission of an unobservable string on a noisy channel. Section 5 presents the algorithm for finite estimation, while in Sect. 6 we present the algorithm for finite matching. Section 7 proves the properties of matching algorithms on infinite streams, while Sect. 8 draws some conclusions.

Some facts about regular expressions
We present here a brief review of some relevant facts about regular expressions, limited to what we shall use in the remainder of the paper. The interested reader may find more detailed information in the many papers and texts on the subject [3,14].
Let be a finite set of symbol, which we call the alphabet. We shall denote with * the set of finite sequences of symbols of , including the empty string . A word, or string on is an element a 0 · · · a L−1 ∈ * . We indicate with |ω| the number of symbols of the string ω. String concatenation will be indicated by juxtaposition of symbols. Ranges of ω will be indicated using pairs of indices in square brackets, that is, Syntactically, the regular expressions that we use in this paper are standard: with a ∈ . The symbol represents the expression that only generates the empty string, while the symbol η is the expression that doesn't generate any string. Given an expression φ its length |φ| is the number of symbols it contains. Our semantics is derived from the standard semantics for ω | φ [14]. The language generated by φ, L(φ) is defined as L(φ) = {ω|ω ∈ * ∧ ω | φ}. Note that L( ) = {ε}, and L(η) = ∅. Two expressions are equivalent if they generate the same language. The recognition problem for regular expressions can be defined as follows: given an expression φ on an alphabet and a string ω ∈ * , is it the case that ω ∈ L(φ) (or, equivalently, that ω | φ)? If the answer is yes, we say that φ recognizes ω.
One important aspect of regular expressions is their connection with finite state automata.

Definition 1 A (nondeterministic) finite state automaton (NFA) is a 5-tuple
where Q is a finite set of states, is the input alphabet, q 0 ∈ Q is the initial state, F ⊆ Q is the set of final states, and δ ⊆ Q × ( ∪ {ε}) × Q is the state transition relation In the following, we shall mostly restrict our attention to a class of NFA that we call simple. An NFA is simple if it doesn't have multiple transitions between pairs of states, except possibly for the presence of -transitions. That is, we never have a fragment of state diagram such as Formally we have: is simple if for all q, q ∈ Q and all a, a ∈ , a, a = , if δ(q, a, q ) and δ(q, a , q ) then a = a .
It is easy to transform an NFA into simple form: for each multiple arc from q to q and for each symbol a in that arc, one creates a new state q a connected to q with an -transition and connects q to q a with an arc labeled a. That is, if δ contains a subset which violates the condition, this subset is eliminated from δ and replaced with δ = {δ(q, a 1 , q a 1 ), . . . , δ(q, a k , q a k ), δ(q a 1 , , q ), . . . , δ(q a k , , q )}.
It is easy to see that the NFA with transitions (δ\δ ) ∪ δ is simple and equivalent to the original one. Graphically, the process can be represented (for k = 2) as Note that the most common algorithms for building an NFA given an expression φ, such as Thompson's [30] create simple automata.

Matching as path finding
The matching algorithm that we use in this paper is a modification of a method known as the Cartesian graph (also known as the DB-Graph) [24]. Let A = (Q, , q 0 , F, δ) be the (nondeterministic) automaton that recognizes a regular expression φ, and let ω = a 0 · · · a L−1 be a finite string of length L. We build the Cartesian graph C(φ, ω) = (V , E) as follows: (i) V is the set of pairs (q, k) with q ∈ Q and k ∈ [0, . . . , L]; and there is an ε-transition between q and q , that is δ(q, ε, q ).
In order to simplify the representation, in the figures we shall indicate the vertex (q i , k) as q k i . Recognition using the graph is based on the following result: Proof ω | φ iff the automaton has an accepting run, that is, a sequence of states q 0 q 1 · · · q n such that δ(q i−1 , a i−1 , q i ) and q n ∈ F. It is immediate to see from the definition of the graph that such a run exists iff there is a path (q 0 , 0) → · · · → (q n , L) (7) in C(φ, ω) (L ≤ n) 1 .
In many cases we shall be interested in determining whether there is a sub-string ω[ i : j ] of ω that matches φ. To this end, it is easy to verify the following result: and the string ω = abaababb. The graph C(φ, ω) is The double edges show a path from q 0 0 to q 8 4 corresponding to the accepting run q 0 q 3 q 2 q 0 q 3 q 2 q 0 q 1 q 4 q 4 , which shows that the string matches the expression. Note that the sub-string ω[ 3 : 7 ] = abab also matches the expression, corresponding to the path q 3 0 → q 4 3 → q 5 2 → q 6 0 → q 6 1 → q 7 4 .
Example II: Consider the same expression and the string ω = aba; the graph The graph has no path from q 0 0 to q 3 4 , indicating that the string doesn't match the expression. However, there is a path from q 0 0 to q 2 4 , indicating that the sub-string ω[ :2 ] = ab does match the expression

The uncertainty model
We consider the probabilistic model of string production and detection shown schematically in Fig. 1.
The module M emits a string ω = a 0 · · · a L−1 ∈ * . In many cases of practical interest, the elements ω[k] are not emitted independently. Rather, the fact that ω[k] = a k skews the probability distribution of ω[k + 1]. Correspondingly, we assume that M is a Markov chain with transition probabilities τ (a|b), a, b ∈ . In this case, τ (a i |a i−1 ) is the conditional probability distribution of the ith element of ω. In order to simplify the equations that follow, we formally define τ (a 0 |a −1 ) = τ (a 0 ), the a priori probability that the first symbol of the chain were a 0 The channel N introduces some noise so that, when the symbol ω[k] = a is emitted, we observe a probability distribution ν[k] over such that This process is fed to the recognition algorithm, which determines the most likely interpretation of ν that matches the expression the values ν[k](a) are the observations on which we base the estimation, and constitute, together with the transition probabilities τ (a|b), the input of the problem. Suppose that a string ω is produced by the module M, that the transition probabilities τ are known a priori, and that the stochastic process ν is observed. The string ω is, of course, unobservable. We are interested in two problems: finite estimation: assuming that there is at least one substring of ω, ω[ i : j ] such that ω[ i : j ] | φ, which is the most likely matching substring? finite matching: can we determine (with a prescribed confidence) whether there is at least one substring The solution of the second problem can be based on the solution of the first, to which we now turn.

Finite estimation
Given that the module M emits a string ω of length L, in this section we are interested in finding the most likely substring ω[ i : j ] that matches φ.
When we match substrings, we are trying to match φ with strings of different length, and this entails that we must compensate a bias towards shorter strings. The a posteriori probability of a string ω is given by the product of the probabilities of its constituent symbols. These probabilities, in general, will be composed of two terms: a probability that ω[k] were in the string given the observations ν [k], and the probability τ (ω[k]|ω[k −1]) that the symbol ω[k] were generated. Both these terms have values in [0, 1], and so has their product. This means that the a posteriori probability of ω[ i : j ] is the product of ( j − i) terms smaller than one. That is, coeteris paribus, a shorter string, being the product of a smaller number of terms, will have a higher probability and will therefore be chosen.
We avoid this bias by considering the information carried by a string. If we have no a priori information on the string that is produced, its being revealed to us would carry an information ι(ω) = − log P(ω). If we have observed the process ν, we already possess some information about the string, and its being revealed to us would give us an information ι ν (ω) = − log P(ω|ν) ≤ ι(ω). The information that the process ν gives us about the string ω is the difference of these two values: Given ν, we search the string ω that maximizes I (ω, ν). The term P(ω) at the denominator (which comes from considering the a priori information ι(ω)) avoids the bias toward shorter strings. Given two strings ω 1 and ω 2 with the same a posteriori probability and ω 2 longer than ω 1 , ω 2 will be selected since P(ω 2 ) < P(ω 1 ). The rationale here is that longer strings are less likely to be emitted by chance so if we have equal evidence to support the hypothesis that either ω 1 and ω 2 were emitted, it is reasonable to select ω 2 .
In order to compute I (ω, ν), we begin by computing P(ω|ν). Let ω = a 0 · · · a L−1 . Then The last equality reflects the fact that ω[L −1] only depends on observations at time L − 1. We make the hypothesis that the conditional probability of a k occurring in position t conditioned on ν[t] depends only on the observation on a k , at step t, that is, this is tantamount to considering that our observations are complete: the value ν

[t](a)
gives us all the information available on a. With this hypothesis we have The equality (*) depends on two properties: first, the measure on ω[L−1] does not depend on the previous values of ω and, second, on the Markov property Putting this result in (13), we have is the a priori probability of observing ω. Working out the recursion and using this definition we have: Substituting (19) in (12), the criterion that we want to maximize is The a priori probabilities P ν (ω) and P(ω) will be estimated assuming that no a priori information is available, that is, assuming a uniform distribution leading to We are interested not only in detecting maximum length strings, but in detecting substrings ω[ i : j ] as well. To this end, we define the partial information difference: The second expression highlights the effect of considering prior information: the term ( j − i) log | | 2 is the bias that, all else being equal, favors the detection of longer strings.
Our problem can therefore be expressed as finding the sub-string: Two simplified cases are of importance in applications. The first is when the generation of the symbols has no temporal dependence, in which case τ (a k | a k−1 ) = τ (a k ), and the second is when the symbols are generated with uniform a priori probability, in which case τ (a k ) = 1/| | and Finding the string that maximizes L is the basis on which we define several forms of matching.

Definition 3
Given the string ω and the expression φ, we say that the sub-string Matching is defined as an optimality criterion over β-matchings. We use two such criteria: the first (weakly optimal) restricts optimality to continuations of a string, while the second (strongly optimal) extends it to all matching sub-strings.

Definition 4
Given the string ω and the expression φ, ω weakly-optimally matches φ Definition 5 Given the string ω and the expression φ, ω strongly-optimally matches The following property is obvious from the definition The function that builds the string ω[π ] associated to a path π in the Cartesian graph C. If π is not a path in C, the function returns the empty string

Matching method
We match the expression to uncertain data using a modification of the Cartesian graph. Let A = (Q, , q 0 , F, δ) be the NFA that recognizes the expression φ, ν the observed process of length n and, for i = 0, . . . , n − 1, a i ∈ , let ν[i](a) be given. We shall assume that A is simple. The modified Cartesian graph is a weighted graph For such an edge, we set σ [(u, v)] = a ∈ if ii.a applies, and σ [(u, v)] = if ii.b applies; that is, given an edge e, σ [e] is the symbol that causes e to be crossed.
In order to use the graph to find L-matches, we need a way to associate possible strings (viz., strings with non-zero probability) to paths in the graph. Given the path π = [π 0 , . . . , π n ] with π k = (s k , h), h ≤ k, we build the string ω[π ] applying the function mstr in Fig. 2.

Lemma 2 If the NFA is simple then for each path π , mstr(C, π) is unique.
This lemma is a consequence of the fact that, if the NFA is simple, for each edge there is only one a ∈ that causes it to be traversed, that is, σ [e] is a well-defined function.
Proof Let C be the Cartesian graph (without uncertainty) generated by ω on φ. Let ((q, t), (q , t + 1)) be an edge on C caused by a t ∈ . By Lemma 3, ν[t](a t ) > 0, so the edge will also be an edge of G, that is, G is a subgraph of G.
Finding the optimal match to the expression is akin to finding the shortest path on a weighted graph, with some modifications. In a typical shortest path algorithm, each edge (u, v) has a weight w(u, v) and, given a path π = [u 0 , u 2 , . . . , u n ], the weight of the path is the sum of the weights of its edges, that is w[π ] = n−1 i=0 w(u i , u i+1 ). Moreover, each vertex u has associated a distance value d [u]. When the vertex is analyzed, 2 its in-edges are analyzed. If the situation is the following: then the distance value of u is updated as In our case, rather than with a weight w, we mark each edge with a pair (a, ν[·](a)), where a ∈ is the symbol that causes that edge to be crossed, and ν is the probability that the symbol emitted at that particular step were a. We look for the most probable path, that is, we are trying to maximize, rather than minimize, a suitable (additive) function of the weights of the graph. We also have, with respect to the standard algorithm, a complication due to the conditional probabilities of the Markov chain: in the general case, the estimation that we have to minimize for the node u depends not only on the in-neighbors v 1 , . . . , v n , but also on the label of the edges of the optimal paths that enter the nodes v 1 , . . . , v n . To clarify this point, given a node u, let L[u] be the estimate of the criterion that we are maximizing for the paths through u. Suppose we are evaluating a path entering u: The estimation L[u] for this path is given by So, in order to update the estimate L[u] we need to look at one edge further back than we would for a normal path-finding algorithm. In general, given the edges entering u: This complication is not present if the symbols are generated independently, that is, if we are optimizing (25). It would not be hard to adapt one of the standard shortest path algorithms to work for this case, but it is more efficient to take advantage of the structure of the Cartesian graph. In the graph, we have only two kinds of edges: forward edges (q, k) → (q , k + 1) and -edges, corresponding to the ε-transitions, (q, k) → (q , k); -edges can be traversed at any time without changing the value of the objective function.
The algorithm is composed of two parts: the first is the top level function match that receives the NFA for the expression and the observations ν, builds the graph C(φ, ν) with the edges marked by the probability of having received the corresponding symbol, and manages the traversal of forward edges. The second is a local relaxation function that checks whether the criterion estimation of some nodes can be improved by traversing some -edges. This function also checks whether it is convenient to start a new path: if the state q 0 has a negative estimation, then its estimation is set to 0 and a new path is started. Table 1 shows the symbols used in the algorithm. The function relax is shown in Fig. 3. The auxiliary function L (Fig. 4) receives a pair of states q and q and a time t, and determines the value L[q, t] resulting by arriving at (q, t) from (q , t−1). If there is no edge between the two states, then the function returns −∞.
The main function of the algorithm is shown in Fig. 5. The algorithm proceeds timewise from the first symbol received to the last. The main loop adjusts the objective taking into account the edges from the previous time step, and then calls the function relax to take into account -edges and possible re-initializations of the path.
The algorithm returns the node that maximizes L[q,k]; the loops of steps 9 and 10 go through the nodes of the graph and for each node (q, k), steps 11 and 13 choose the predecessor (q , k − 1) that maximizes L[q,k] among all symbols read at step k − 1. If there is a node (q , k) that provides a better L[q,k], that is, if the objective at step k is maximized by not reading any symbol and doing instead an ε-transition, then that The main matching algorithm. The main loop of lines 10-14 proceeds time-wise updating at each step the objective estimation for all the states after symbol i. Once the "best" predecessor of a state has been found (line 11), the value of the objective for that cost as well as its predecessor are updated (lines [12][13]. At the end, w contains the final state that represents the end of the most likely path. The path can be reconstructed following the predecessor pointers until an initial path Fig. 6 The function that creates a path ending at a given state. The predicate initial is true if the parameter q is the initial state of the automaton option will be discovered at step 9 of the function relax and the value L[q,k] will be updated at step 10.
Finally, step 17 of match will return the final state with the highest value of the objective function, that is, the state where the optimal accepting path ends.
At the end, q f contains the final state that represents the end of the most likely path. A simple recursive function (Fig. 6) can then be used to return the optimal path.
As an implementation note, observe that we have presented here a fairly naïve implementation of the algorithm, one that explicitly generates the whole graph. In a more optimized version, one can generate at step k only the states (q, k) with finite cost, and keep track only of the open paths for each node. This implementation is akin to the standard implementation of an NFA, with the additional complication that, if the optimal string is to be reported (as opposed to requiring a simple yes/no answer), one must keep track of the open paths.
The following property is an easy consequence of the maximization of L.  N , ν, L)))

Theorem 3 Let φ be an expression, N the associated NFA, and ν a series of L observations of a string. If
then ω s | φ Example II Consider, once more, the regular expression of Example I. We detect eight symbols from the alphabet = {a, b}, with probabilities as in Table 2.
We assume that the symbols are independent and equiprobable, so we can apply (26). Before the first iteration, the first column of the graph has been initialized as: Note that the state q 1 has value 0 as the -edge that joins it to q 0 has been relaxed. During the second iteration, we consider the edges to the second column, with the following weights: T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T T T n n n n n n n n n · to the value of the predecessor for each incoming edge, and taking the maximum. If the start state has a negative value or if some value may be increased by traversing an -edge, this is done in the function relax. The graph is now: The state q 0 had a negative value of −∞, so it has been reset to 0, making it possible to start a new path. Continuing until all the symbols have been processed, we arrive to the graph of Fig. 7. The state at the bottom-right of the figure, with a value L = 3.56 is the final state where the optimal path ends. The optimal path is indicated by double arrows. Note that it does not extend to the whole input, but it begins at step 3, and corresponds to the input ababb. The reason for this is the very low probability of the symbol a in step 3, a symbol that would be necessary to continue the sequence ab that began in the first step. At the third step, the state that precedes q 0 and that would have to transition to q 0 to permit the continuation (state q 2 ) has a value 0.68; given that a has a probability of only 0.01, this gives: L[s 0 , 3] = L[s 2 , 2] + log | |ν(a) = 0.68 + log 2 · 0.01 = −4.96 (30) This negative value causes (q 0 , 3) to be reset to 0 in the function relax and a new path to be started.

Probability of misdetection
In this section we are interested in studying some illustrative examples of detection error. With reference to Fig. 1, we assume that the module M emits a string ω = a 0 · · · a L−1 and that an initial substring ω matches φ. We introduce some error in N , and we are interested in determining under which conditions the algorithm will misclassify, that is, it will estimate a ω = ω as the best match for the expression. Note that this can be seen as a constrained estimation problem: we estimate ω based not only on the probabilities ν(a) but also on the constraint that our estimation must be such that ω | φ.
Example III Consider the following situation: we have an alphabet with a, b ∈ and the expression φ = a * . Assume that the module M of Fig. 1 emits the string a · · · ab, where the symbol a is repeated n times. The detection probabilities are assumed to be constant, independent of the position: We are interested in analyzing the following two scenarios: (i) the symbol "b" in ω is correctly detected; consequently, the algorithm will detect that ω | φ, but that ω[ :n ] | φ (correct classification), or (ii) the symbol "b" is misinterpreted as an "a", in which case the algorithm will match the whole ω to φ (misclassification).
In the first case, the value of the objective function will be The second will give a value The algorithm will produce the solution ii) (viz., it will misclassify) if L − L > 0, that is, if If c is relatively high, then the probability of confusion is small, the algorithm will assume that the last symbol is a "b" and match (correctly) the shorter string. On the other hand, if c is small, then the uncertainty on the symbol that has actually been emitted is higher, and the cost of assuming that the symbol is actually an "a" gives a higher value of the objective function, as it permits the identification of a longer string. Note that in this case the threshold at which misclassification occurs is independent of n, the length of the string.

Example IV
In this example, we consider a case of considerable interest in applications: spike noise (noise on a single symbol). Consider again the expression φ = a * and the string a n aa m . We call central a the symbol that comes between the two sequences a n and a m , and we are interested in determining the effect of spike noise in the central a on the detection of the string. Assume, for the sake of this example, that we are interested only in detecting initial sub-strings of ω. Suppose that the observations are the same for all symbols except the central a, that is In our scenario, most of the symbols are detected with low noise, in particular c > 1/| |, while at the central a the noise spikes, that is c < c. The scenarios in which we are interested are the following: Fig. 8 The relation (39), | |c is represented as a function of | |c for m = 1, 2, 5, 10, 20. The portion above each curve corresponds to the area in which the correct decision is made. Note that if the string that follows the spike (of length m) is short, the wrong interpretation will prevail for relatively small errors but as m grows, matching becomes more robust, and the correct interpretation is maintained for larger errors (viz. small c ) (i) the central a is mistakenly interpreted as a different symbol, and the algorithm chooses a n as the best initial matching string; (ii) the central a is correctly interpreted, and the algorithm identifies a n+m+1 as the best initial matching string.
The value of the objective function in the first case is The correct interpretation is chosen if L > L, that is Note that the value of c for which misinterpretation occurs does not depend on n, that is, it does not depend on the part of the string before the noise spike, as this part contributes equally to both scenarios. It does, on the other hand, depend on m, that is, on the length of the portion of string that follows the spike. The relation (39) is illustrated in Fig. 8. The condition c > 1/| | translates to | |c > 1, hence the lower limit of the abscissas.
For constant | |, the limiting value of c decreases when c increases as well as when m does. In other words, we can tolerate more noise in the central a if we have a smaller error on the other symbols or if the input string is longer: both cases provide more (a) (b) (c) Fig. 9 Updating the estimated probability of reaching the state (q, t) through an -transation. The structure of the graph fragment that we are considering is shown in a. In b, the value p[q, t] is the estiated probability of reaching state (q, t) without considering the -transition, and p[q , t] is the probability of reaching (q , t), the source of the transition. In c the updated probabilities are shown evidence that the whole string matched φ, thus offsetting the effects of uncertainty on the central a.
Also, all else being equal, the threshold value for c behaves as c ∼ | | −(m−1) , that is, it decreases as | | increases. This is due mostly to the characteristics of our setup: if the probability of observing the correct symbol is held fixed to c so, as | | increases, the probability of the incorrect ones decreases as 1/(| | − 1).

Remark 1
This example, its simplicity notwithstanding, is quite general. Each time we have an expression φ and strings ω, ω such that ω | φ and ωω | φ, and an error spike on a symbol of ω , the considerations of this example apply with m = |ω |.

Finite match
We now turn to the second problem introduced in Sect. 4: finite match. Given the expression φ and an (unknown) string ω = a 0 · · · a L−1 , information about which is only available through the stochastic process ν, we want to know the probability that ω | φ. We begin by considering matching the whole string only; we then extend the method to determine the probability that (at least) a sub-string of ω match φ.
We begin by determining, using the Cartesian graph, the probability that starting from a state (q s , t s ), we arrive at a state (q , t), t ≥ t s . The structure of the algorithm is similar to that of the algorithm match of Fig. 5 but, in this case, instead of computing the value L[q, t] for each state we compute the probability p[q, t] of reaching it. We begin by setting p[q s , t s ] = 1 and p[q , t ] = 0 for (q , t ) = (q s , t s ). We then operate iteratively in two steps: the first is a relaxation function that corrects the probability of reaching (q, t) from another state (q , t) through an -edge that is, the function operates the transformation shown in Fig. 9.
We assume that in the previous step we had already estimated the probability of arriving at (q, t) from states of type (q , t − 1). This step updates the estimate by (a) (b) (c) Fig. 10 Updating the estimated probability of reaching the state (q, t +1) from the states (q 1 , t), . . . , (q n , t).
The structure of the graph fragment that we are considering is shown in a. In b, the values p[q u , t] have been estimated at a previous step. In c the probability of reaching (q, t + 1) is estimated (without considering the -transitions between states at t + 1) as the weighted sum of the probabilities of reaching (q u , t), weighted by the probability of having observed the input that causes the transition (q u , t) → (q, t + 1) considering the -translation as This entails, coherently with our model, that the probability of executing an ε-transition is 1. The second step is a forward projection step, in which we estimate the probability of reaching states at t + 1 based on the probabilities at t. The projection operation is shown in Fig. 10. The probability of reaching (q, t + 1) is a weighted sum of the probabilities of reacing the abutting (q u , t) states, each weighted by the probability of observing in input the symbol that causes the transition (q u , t) → (q, t), that is This procedure, alternating forward projections and relaxations, correctly determines, given the start state, the probability of reaching any other state, with an exception. If a portion of the graph has a configuration like then it is easy to see that p 2 = p 3 = p 1 · ν[k](a), while our recursion computes p 3 = 2· p 1 ·ν[k](a). This configuration, however, is never encountered as the automata that we are considering are simple (see Definition 2). The algorithm takes in input a point t of the input string and the initial state q 0 , and produces an array p with the probabilities of reaching the other states: that is, p[q, t ], t > t is the probability of reaching (q, t ) starting from (q 0 , t). Fig. 11 The relaxation function for the probability determination algorithm. The set of states is topologically sorted using the graph induced by the ε-transition and each node propagates its probability value to its followers in the order Fig. 12 The main function for determining the probability of matching. The initial node is (q 0 , t 0 ), which is reached with probability 1 (set in line 7). The following loop (lines 9-16) goes one step at the time updating at each one the probability that a state is reached through a non-ε symbol (loop of lines [10][11][12][13][14] or through an -transaction (relax of line 15) The function prelax, analogous to relax of Fig. 3 works based on a topological ordering of the sub-graph of the NFA induced by the ε-transitions. The ε-transitions are acyclical, so the set of states of the NFA with the edges corresponding to the ε-transitions is a DAG, and the topological ordering is well defined. The function eps_sort (not described here) returns the list of states topologically sorted (Fig. 11).
The main function, MatchProb takes an initial node of the Cartesian graph and determines the probability of reaching all the other nodes that can be reached from the initial one (Fig. 12).
The probability that the whole string match the expression is the probability that, starting at the first symbol one reaches a final at time L, being L the length of the string. That is, To determine the probability of one matching substring, let Then p k [q, t] = 0 for t < k, and for t ≥ k, p k [q, t] is the probability that, starting from state q 0 at symbol number k, and based on the observations, the unknown substring ω[ k : t ] will lead to state q. The probability that ω[ k : t ] | φ is therefore The probability that at least one of the sub-strings ω[ k : t ] for t > k match the expression is Finally, the probability that at least one sub-string match the expression is

Remark 2
The applicability of the probability approach is limited in the case of expressions that can be satisfied by short strings. In this case, even if the probability if seeing the right symbol is relatively low, the sheer number of possible short expressions makes the probability of at least one match quite high.
Example V Consider again the expression φ ≡ a * and the string ω = a n , with a ∈ and ν[k](a) = c for k = 0, . . . , L − 1.
If we take a specific one-symbol substring, say ω k = a, we have P ω k | φ = c. There are L such sub-strings, so the probability that at least one of them match φ is For a specific two-symbol substring we have P ω k ω k+1 | φ = c 2 and, since there are L − 1 such string, we have In general, the probability that at least one k-symbol substring match φ is The probability that at least one substring match the expression is Figure 13a shows the behavior of P M as a function of c for various values of L. In order to have a better view of the speed of convergence of the function, in Fig. 13b we show the value log(1 − P M ), which converges to −∞ as P M converges to 1.
The probability of having at least one match is very close to 1 for n > 4 or c > 0.4; this constantly high probability limits the discriminating power of the probability test.

Remark 3
The problem highlighted in the previous example is present only for expressions that can be matched by short strings. In the example, most of the probability of match is due to the probability of matching one-symbol strings: Fig. 14 shows (P − P 1 )/P.
In most cases, the error that one would commit by replacing P with P 1 is less than 10%; this entails that the probability method is viable for expressions that do not match short string as the fast convergence to the probability 1 would not occur in those cases.
If short matching strings are common, a viable solution for practical applications is to find the best matching substring ω[ i : j ] and use the value L i, j (ω) as an indicator of the likelihood of matching. We shall not pursue this possibility in this paper.

Infinite streams
Many applications, especially on-line applications, require the detection of certain combinations of symbols in an infinite stream of data. Most of these applications are real-time and use a terminology a bit different from what we use here: what we have called symbols are often elementary events detected in the stream, and our position in the string corresponds to the time of detection (in a discrete time system).
In the case of infinite streams, we are not interested in finding the one sub-string that best matches the expressions: in general there will be infinitely many string in different parts of the stream, possibly partially overlapping, that match the expressions. We are interested in catching them all. This multiplicity causes several problems for the definition of a proper semantics for collecting matching strings (many problems arise out of having to decide what to do when matching strings overlap) which, in turn, may cause decidability issues [25]. We shall not consider those issues here, as they are orthogonal to the problems caused by uncertainty: if we can solve the basic problem of deciding whether ω w | φ under uncertainty, then all the problems related to the definition of a proper semantics in a stream can be worked out using the theory in [25] (in which these problems were considered under the hypothesis of no uncertainty).
In the case of streams, we are not typically interested in strong semantics, which represents too strong a condition for practical applications. Given a (finite) portion of the stream ω such that ω β | φ, it is clearly undecidable in an infinite stream whether there will be, at some future time, a portion ω such that ω β | φ with β > β. Moreover, in streams we are interested in determining a collection of finite strings that match the expression, so the use of an absolute criterion such as the strong semantics (only one string can match the expression in the strong sense) is not very useful.
We shall therefore make use of the weak semantics throughout this section. Since the stream is infinite and we are interested in chunks of it, we shall assume, without loss of generality, that the strings we are testing start at the beginning of the relevant part of the stream, that is, all the strings that we test are sub-strings of type ω[ :k ].
The problem we are interested in is therefore the following: Stream-Weak: Given a string ω, and expression φ, and an infinite stream of observations ν, is it the case that ω w | φ?
As we mentioned, we assume that if |ω| = L, the recognition of ω is based on the first L observations of the stream ν.
Our first result is a simple and negative one.

Theorem 4 Stream-Weak is undecidable.
Proof Suppose the problem is decidable. Then there is an algorithm A such that, for each expression φ, observations ν, and string ω, A(φ, ν, ω) stops in finite time with "yes" if ω w | φ, and with "no" otherwise. Consider the expression φ ≡ a * , and an alphabet with | | > 1 and a ∈ . Suppose that the observations are such that L(a L ) = β and, for k > L, ν[k](a) = q < 1/| |. Then, for N > L: Since the algorithm is correct, it will stop after M steps on "yes". Note that L(a M ) = β < β. If On the other hand A is working as in the previous case on the same data: it will only visit at most M elements on ν , so it will stop on "yes", contradicting the hypothesis that it is correct.

Remark 4
Note that we have proven something stronger than undecidability: undecidability is related to Turing machines, while we have proven that with the available information no finite method can decide the problem, that is, we have problem the unrealizability of the problem [4,22]. In terms of the Cartesian graph, a L corresponds to a path π L and decidability depends on the fact that in order to check matching we only have to extend the path up to m: after that the value of the objective in all paths that extend π L is −∞.
The presence of zero-valued observations, even an infinite number of them, does not always guarantee decidability.  (56)

Example VI
[a] |n is the portion of the list [a] with indices greater than n, that is, the list of indices k > n such that ν[k](a) = 0. From these lists, we build a list as follows: while true do 3.
k ← k a for a in do 7.
[a ] ← [a ] |k 8. od 9. od where uniform( ) is a function that picks an element of at random with uniform distribution. The list is a list of indices such that, for each k ∈ , there is an a ∈ with ν[k](a) = 0. The particularity of is that we pick the indices i such a way that, for each k, the probability that ν[k](a) = 0 is uniform over . The construction of the list is possible due to the hypothesis that each a has zero probability of observation infinitely often. Note that the list is also infinite, so it can never be built completely and, consequently, the algorithm never stops. However, in the proof of the theorem we shall only use finite parts of so one can imagine a lazy evaluation of the algorithm that only computes the portions that we need for the proof. From the construction of , it is easy to see that, for each ξ and a, and, consequently, Lemma 4 Let π be a path in a Cartesian graph. ω[π ] ∈ * the string that causes it to be followed, and let L(π ) > −∞. Then, with probability 1, π is finite.

Proof of Theorem 5
Let ω β | φ; it is ω w | φ if and only if there is ω such that ωω β | φ with β > β. The string ωω corresponds to a path π that, because of lemma 4, with probability 1 is finite, so the hypothesis ω w | φ can be checked in finite time with respect to ω . With probability 1, there is a finite number of such finite paths, so the hypothesis can be checked, with probability 1, in finite time.

Conclusions
In this paper we have considered the problem of detecting whether a string (or part of it) matches a regular expression when the symbols that we observe are subject to uncertainty. The main contributions of this paper are two: on the one hand, we consider the problem of matching the most likely substring of the input, a problem of considerable interest in applications, as the duration of the event that one want to detect may be unknown, and different events of interest may have overlapping structures. We have seen that considering sub-strings produces a bias towards shorter strings, a bias that can be compensated by minimizing the residual information-the information carried by the string that matching does not recover. On the other hand, we show that optimal detection in an infinite stream is undecidable, but becomes decidable with probability one under hypotheses often met in practical applications.
The regular expressions that we are presenting here are quite limited. In particular, they do not allow an efficient definition of counting (expressions like a [n,m] , which is matched if the string contains between n and m symbols a). In principle, regular expressions do allow counting, as the previous expression is equivalent to n a · · · a( + a + aa + · · · + m−n a · · · a) (64) but the implementation of such an expression is so inefficient as to make it impractical in all but the most trivial cases. One possibility to introduce counting as as part of a more general algebra (e.g., a query algebra) of which matching is part. In the example above, the query would be translated into a query with a * as a regular expression plus a condition on the result to ensure that the number of as is the desired. It is not an optimal solution, and the efficient integration of better solutions in the framework presented here is still an open problem.
Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.