Encyclopedia of Algorithms

Living Edition
| Editors: Ming-Yang Kao

Parameterized Pattern Matching

Living reference work entry
DOI: https://doi.org/10.1007/978-3-642-27848-8_282-2

Keywords

Pattern Match Edit Distance Parameterized Text String Match Suffix Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Years and Authors of Summarized Work

1996; Baker1995; Kosaraju1994; Amir, Farach, Muthukrishnan

Problem Definition

Parameterized strings, or p-strings, are strings that contain both ordinary symbols from an alphabet \(\Sigma \) and parameter symbols from an alphabet \(\Pi \). Two equal-length p-strings s and s are a parameterized match, or p-match, if one p-string can be transformed into the other by applying a one-to-one function that renames the parameter symbols. The following example of a p-match is one with both ordinary and parameter symbols. The ordinary symbols are in lowercase and the parameter symbols are in uppercase:

$$\displaystyle\begin{array}{rcl} s& =& A\,\,b\,\,A\,\,b\,\,C\,\,A\,\,d\,\,b\,\,A\,\,C\,\,d\,\,d \\ s'& =& D\,\,b\,\,D\,\,b\,\,E\,\,D\,\,d\,\,b\,\,D\,\,E\,\,d\,\,d \\ \end{array}$$
In some of the problems to be considered, it will be sufficient to solve for p-strings in which all symbols are parameter symbols, as this is the more difficult part of the problem. In other words, the case in which \(\Sigma =\). In this case, the definition can be reformulated so that s and s are a p-match if there exists a bijection π:\(\Pi _{s} \rightarrow \Pi _{s}^{{\prime}}\), such that π(s) = s , where π(s) is the renaming of each character of s via π.

The following problems will be considered. Parameterized matching – given a parameterized pattern p of length m and parameterized text t, find all locations i of a parameterized text t for which p p-matches \(t_{i}\ldots t_{i+m-1}\), where \(m =\vert p\vert\). The same problem is also considered in two dimensions. Approximate parameterized matching – find all substrings of a parameterized text t that are approximate parameterized matches of a parameterized pattern p (to be fully defined later).

Key Results

Baker [4] introduced parameterized matching in the framework of her seminal work on discovering duplicate code within large programs for the sake of code minimization. An example of two code fragments that p-match taken from the X Windows system can be found in [4].

Parameterized Suffix Trees

In [4] and in the follow-up journal versions [6, 7], a novel method was presented for parameterized matching by constructing parameterized suffix trees. The advantage of the parameterized suffix tree is that it supports indexing, i.e., one can preprocess a text and subsequently answer parameterized queries p in O(\(\vert p\vert\)) time. In order to achieve parameterized suffix trees, it is necessary to introduce the concept of a predecessor string. A predecessor string of a string s has at each location i the distance between i and the location containing the previous appearance of the symbol. The first appearance of each symbol is replaced with a 0. For example, the predecessor string of aabbaba is 0,1,0,1,3,2,2. A simple and well-known fact is that:

Observation 1([7]). 

s and s p-match if and only if they have the same predecessor string.

Notice that this implies transitivity of parameterized matching, since if s and s p-match and s and s ′ ′ p-match, then, by the observation, s and s have the same predecessor string and, likewise, s and s ′ ′ have the same predecessor string. This implies that s and s ′ ′ have the same predecessor string and hence, by the observation, p-match.

Moreover, one may also observe that if r is a prefix of s, then the predecessor string of r, by definition, is exactly the \(\vert r\vert\)-length prefix of the predecessor string of s. Hence, similar to regular pattern matching, a parameterized pattern p p-matches at location i of t if and only if the \(\vert p\vert\)-length predecessor string of p is equal to the \(\vert p\vert\)-length prefix of the predecessor string of the suffix \(t_{i}\ldots t_{n}\). Combining these observations, it is natural to do as follows: create a (parameterized suffix) tree with a leaf for each suffix where the path from the root to the leaf corresponding to a given suffix will have its predecessor string labeling the path. Branching in the parameterized suffix tree, as with suffix trees, occurs according to the labels of the predecessor strings. See [4, 6, 7] for an example.

Baker’s method essentially mimics the McCreight suffix tree construction [18]. However, while the suffix tree and the parameterized suffix tree are very similar, there is a slight hitch. A strong component of the suffix tree construction is the suffix link. This is used for the construction and, sometimes, for later pattern searches. The suffix link is based on the distinct right context property, which does not hold for the parameterized suffix tree. In fact, the node that is pointed to by the suffix link may not even exist. The main parts of [6, 7] are dedicated to circumventing this problem.

In [7] Baker added the notion of “bad” suffix links, which point to the vertex just above, i.e., closer to the root than the desired place, and of updating them with a lazy evaluation when they are used. The algorithm runs in time \(O(n\vert \Pi \vert \log \vert \Sigma \vert )\). In [6] (which is chronologically later than [7] despite being the first to appear) Baker changed the definition of “bad” suffix links to point to just below the desired place. This turns out to have nice properties, and one can use more sophisticated data structures to improve the construction time to \(O(n(\vert \Pi \vert + \text{log}\vert \Sigma \vert ))\).

Kosaraju [16] made a careful analysis of Baker’s properties utilized in the algorithm of [6] which suffer from the \(\vert \Pi \vert\) factor. He pointed out two sources for this large factor. He handled these two issues by using a concatenable queue and maintaining it in a lazy manner. This is sufficient to reduce the \(\vert \Pi \vert\) factor to a \(\log \vert \Pi \vert\) factor, yielding an algorithm of time \(O(n(\log \vert \Pi \vert +\log \vert \Sigma \vert ))\).

Obviously if the alphabet or symbol set is large, the construction time may be O(nlogn). Cole and Hariharan [9] showed how to construct the parameterized suffix trees in randomized O(n) time for alphabets and parameters taken from a polynomially sized range, e.g., [1, ,  n c ]. They did this by adding additional nodes to the tree in a back-propagation manner which is reminiscent of fractional cascading. They showed that this adds only O(n) nodes and allows the updating of the missing suffix links. However, this causes other problems and one may find the details of how this is handled in their paper.

More Methods for Parameterized Matching

Obviously the parameterized suffix tree efficiently solves the parameterized matching problem. Nevertheless, a couple of other results on parameterized matching are worth mentioning.

First, in [6] it was shown how to construct the parameterized suffix tree for the pattern and then to run the parameterized text through it, giving an algorithm with O(m) space instead of O(n).

Amir et al. [2] presented a simple method to solve the parameterized matching problem by mimicking the algorithm of Knuth, Morris, and Pratt. Their algorithm works in \(O(n\, {\ast}\,\min (\log \vert \Pi \vert ,m))\) time independent of the alphabet size \((\vert \Sigma \vert )\). Moreover, they proved that the log factor cannot be avoided for large symbol sets.

In [5] parameterized matching was solved with a Boyer-Moore type algorithm. In [10] the problem was solved with a Shift-Or type algorithm. Both handle the average case efficiently. In [10] emphasis was also put on the case of multiple parameterized matching, which was previously solved in [14] with an Aho-Corasick automaton-style algorithm.

Two-Dimensional Parameterized Matching

Two-dimensional parameterized matching arises in applications of image searching; see [13] for more details. Two-dimensional parameterized matching is the natural extension of parameterized matching where one seeks pmatches of a two-dimensional parameterized pattern p within a two-dimensional parameterized text t. It must be pointed out that classical methods for two-dimensional pattern matching, such as the L-suffix tree method, fail for parameterized matching. This is because known methods tend to cut the text and pattern into pieces to avoid going out of boundaries of the pattern. This is fine because each pattern piece can be individually evaluated (checked for equality) to a text piece. However, in parameterized matching, there is a strong dependency between the pieces.

In [1] an innovative solution for the problem was given based on a collection of linearizations of the pattern and text with the property to be currently described. Consider a linearization. Two elements with the same character, say “a,” in the pattern are defined to be neighbors if there is no other “a” between them in this linearization. Now take all the “a”s of the pattern and create a graph G a with “a”s as the nodes and edges between the two if they are neighbors in some linearization. We say that two “a”s are chained if there is a path from one to the other in G a . Applying one-dimensional parameterized matching on these linearizations ensures that any two elements that are chained will be evaluated to map to the same text value (the parameterized property). A collection of linearizations has the fully chained property if every two locations in p with the same character are chained. It was shown in [1] that one can obtain a collection of log m linearizations that is fully chained and that does not exceed pattern boundary limits. Each such linearization is solved with a convolution-based pattern-matching algorithm. This takes O(n 2log m) time for each linearization, where the text size is n 2. Hence, overall the time is \(O(n^{2}\log ^{2}m)\).

A different solution was proposed in [13], where it was shown that it is possible to solve the problem in \(O(n^{2} + m^{2.5}\mathrm{polylog}\ m)\), where the text size is O(n 2) and the pattern size is O(m 2). Clearly, this is more efficient for large texts.

Approximate Parameterized Matching

Our last topic relates to parameterized matching in the presence of errors. Errors occur in various applications and it is natural to consider parameterized matching with the Hamming distance metric or the edit distance metric.

In [8] the parameterized matching problem was considered in conjunction with the edit distance. Here the definition of edit distance was slightly modified so that the edit operations are defined to be insertion, deletion, and parameterized replacements, i.e., the replacement of a substring with a string that p-matches it. An algorithm for finding the “parameterized edit distance” of two strings was devised whose efficiency is close to the efficiency of the algorithms for computing the classical edit distance.

However, it turns out that the operation of parameterized replacement relaxes the problem to an easier problem. The reason that the problem becomes easier is that two substrings that participate in two parameterized replacements are independent of each other (in the parameterized sense).

A more rigid, but more realistic, definition for the Hamming distance variant was given in [3]. For a pair of equal-length strings s and s and a bijection π defined on the alphabet of s, the π-mismatch is the Hamming distance between the image under π of s and s . The minimal π-mismatch over all bijections π is the approximate parameterized match. The problem considered in [3] is to find for each location i of a text t the approximate parameterized match of a pattern p with the substring beginning at location i. In [3] the problem was defined and linear-time algorithms were given for the case where the pattern is binary or the text is binary. However, this solution does not carry over to larger alphabets.

Unfortunately, under this definition, the methods for classical string matching with errors for the Hamming distance, also known as pattern matching with mismatches, seem to fail. Following is an outline of a classical method [17] for pattern matching with mismatches that uses suffix trees.

The pattern is compared separately with each suffix of the text, beginning at locations 1\(\leq i \leq n - m + 1\). Using a suffix tree of the text and precomputed longest common ancestor information (which can be computed once in linear time [11]), one can find the longest common prefix of the pattern and the corresponding suffix (in constant time). There must be a mismatch immediately afterwards. The algorithm jumps over the mismatch and repeats the process, taking into consideration the offsets of the pattern and suffix.

When attempting to apply this technique to a parameterized suffix tree, it fails. To illustrate this, consider the first matching substring (up until the first error) and the next matching substring (after the error). Both of these substrings p-match the substring of the text that they are aligned with. However, it is possible that combined they do not form a p-match. See the example below. In the example abab p-matches cdcd followed by a mismatch and subsequently followed by abaa p-matching efee. However, different π’s are required for the local p-matches. This example also emphasizes why the definition of [8] is a simplification. Specifically, each local p-matching substring is one replacement, i.e., abab with cdcd is one replacement and abaa with efee is one more replacement. However, the definition of [3] captures the globality of the parameterized matching, not allowing, in this case, abab to p-match to two different substrings.

$$\displaystyle\begin{array}{rcl} p& =& a\,\,b\,\,a\,\,b\,\,a\,\,a\,\,b\,\,a\,\,a\ldots \\ t& =& \ldots c\,\,d\,\,c\,\,d\,\,d\,\,e\,\,f\,\,e\,\,e\ldots \cdot \\ & & \\ \end{array}$$
In [12] the problem of parameterized matching with k mismatches was considered. The parameterized matching problem with k mismatches seeks all locations i in text t where the minimal π-mismatch between p to \(t_{i}\ldots t_{i+m-1}\) has at most k mismatches. An \(O(nk^{1.5} + mk\log \ m)\) time algorithm was presented in [12]. At the base of the algorithm, i.e., for the case where \(\vert p\vert =\vert t\vert = m\), an O(m + k 1. 5) algorithm is used based on maximum matching algorithms. Then the algorithm uses a doubling scheme to handle the growing distance between potential parameterized matches (with at most k mismatches). Also shown in [12] is a strong relationship between maximum matching algorithms in sparse graphs and parameterized matching with k errors.

The rigid, but more realistic, definition for the Hamming distance version given in [3] can be naturally extended to the edit distance. Lately, it was shown that this problem is nondeterministic polynomial-time complete [15].

Applications

Parameterized matching has applications in code duplication detection in programming languages, in homework plagiarism detection, and in image processing, among others [1, 4].

Cross-References

Recommended Reading

  1. 1.
    Amir A, Aumann Y, Cole R, Lewenstein M, Porat E (2003) Function matching: algorithms, applications and a lower bound. In: Proceedings of the 30th international colloquium on automata, languages and programming (ICALP), Eindhoven, pp 929–942Google Scholar
  2. 2.
    Amir A, Farach M, Muthukrishnan S (1994) Alphabet dependence in parameterized matching. Inf Process Lett 49:111–115CrossRefMATHGoogle Scholar
  3. 3.
    Apostolico A, Erdös P, Lewenstein M (2007) Parameterized matching with mismatches. J Discret Algorithms 5(1):135–140CrossRefMATHGoogle Scholar
  4. 4.
    Baker BS (1993) A theory of parameterized pattern matching: algorithms and applications. In: Proceedings of the 25th annual ACM symposium on the theory of computation (STOC), San Diego, pp 71–80Google Scholar
  5. 5.
    Baker BS (1995) Parameterized pattern matching by Boyer-Moore-type algorithms. In: Proceedings of the 6th annual ACM-SIAM symposium on discrete algorithms (SODA), San Francisco, pp 541–550Google Scholar
  6. 6.
    Baker BS (1996) Parameterized pattern matching: algorithms and applications. J Comput Syst Sci 52(1):28–42CrossRefMATHGoogle Scholar
  7. 7.
    Baker BS (1997) Parameterized duplication in strings: algorithms and an application to software maintenance. SIAM J Comput 26(5):1343–1362CrossRefMATHMathSciNetGoogle Scholar
  8. 8.
    Baker BS (1999) Parameterized diff. In: Proceedings of the 10th annual ACM-SIAM symposium on discrete algorithms (SODA), Baltimore, pp 854–855Google Scholar
  9. 9.
    Cole R, Hariharan R (2000) Faster suffix tree construction with missing suffix links. In: Proceedings of the 32nd ACM symposium on theory of computing (STOC), Portland, pp 407–415Google Scholar
  10. 10.
    Fredriksson K, Mozgovoy M (2006) Efficient parameterized string matching. Inf Process Lett 100(3):91–96CrossRefMATHMathSciNetGoogle Scholar
  11. 11.
    Harel D, Tarjan RE (1984) Fast algorithms for finding nearest common ancestor. J Comput Syst Sci 13:338–355MATHMathSciNetGoogle Scholar
  12. 12.
    Hazay C, Lewenstein M, Sokol D (2007) Approximate parameterized matching. ACM Trans Algorithms 3(3):29CrossRefMathSciNetGoogle Scholar
  13. 13.
    Hazay C, Lewenstein M, Tsur D (2005) Two dimensional parameterized matching. In: Proceedings of the 16th symposium on combinatorial pattern matching (CPM), Jeju Island, pp 266–279Google Scholar
  14. 14.
    Idury RM, Schäffer AA (1996) Multiple matching of parameterized patterns. Theor Comput Sci 154(2):203–224CrossRefMATHGoogle Scholar
  15. 15.
    Keller O, Kopelowitz T, Lewenstein M. Parameterized LCS and edit distance are NP-complete. ManuscriptGoogle Scholar
  16. 16.
    Kosaraju SR (1995) Faster algorithms for the construction of parameterized suffix trees. In: Proceedings of the 36th annual symposium on foundations of computer science (FOCS), Milwaukee, pp 631–637Google Scholar
  17. 17.
    Landau GM, Vishkin U (1988) Fast string matching with k differences. J Comput Syst Sci 37(1):63–78CrossRefMATHMathSciNetGoogle Scholar
  18. 18.
    McCreight EM (1976) A space-economical suffix tree construction algorithm. J ACM 23:262–272CrossRefMATHMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Department of Computer ScienceBar-Ilan UniversityRamat-GanIsrael