Abstract
In the longest common substring problem, we are given two strings of length n and must find a substring of maximal length that occurs in both strings. It is well known that the problem can be solved in linear time, but the solution is not robust and can vary greatly when the input strings are changed even by one character. To circumvent this, Leimeister and Morgenstern introduced the problem of the longest common substring with k mismatches. Lately, this problem has received a lot of attention in the literature. In this paper, we first show a conditional lower bound based on the SETH hypothesis implying that there is little hope to improve existing solutions. We then introduce a new but closely related problem of the longest common substring with approximately k mismatches and use localitysensitive hashing to show that it admits a solution with strongly subquadratic running time. We also apply these results to obtain a strongly subquadratictime 2approximation algorithm for the longest common substring with k mismatches problem and show conditional hardness of improving its approximation ratio.
Similar content being viewed by others
1 Introduction
Understanding how similar two strings are and what they share in common is a central task in stringology. The significance of this task is witnessed by the 50,000+ citations of the paper introducing BLAST [3], a heuristic algorithmic tool for comparing biological sequences. This task can be formalised in many different ways, from the longest common substring problem to the edit distance problem. The longest common substring problem can be solved in optimal linear time and space, while the best known algorithms for the edit distance problem require \(n^{2o(1)}\) time, which makes the longest common substring problem an attractive choice for many practical applications. On the other hand, the longest common substring problem is not robust and its solution can vary greatly when the input strings are changed even by one character. To overcome this issue, recently a new problem has been introduced called the longest common substring with k mismatches. In this paper, we continue this line of research.
1.1 Related Work
Let us start with a precise statement of the longest common substring problem.
Problem 1.1
(LCS) Given two strings \(T_1, T_2\) of length n, find a maximumlength substring of \(T_1\) that occurs in \(T_2\).
The suffix tree of \(T_1\) and \(T_2\), a data structure containing all suffixes of \(T_1\) and \(T_2\), allows to solve this problem in linear time and space [17, 21, 36], which is optimal as any algorithm needs \(\varOmega (n)\) time to read and \(\varOmega (n)\) space to store the strings. However, if we only account for “additional” space, the space the algorithm uses apart from the space required to store the input, then the suffix treebased solution is not optimal and has been improved in a series of publications [5, 26, 32].
The major disadvantage of the longest common substring problem is that its solution is not robust. Consider, for example, two pairs of strings: \(a^{2m+1}, a^{2m} b\) and \(a^{m} b a^{m},a^{2m} b\). The longest common substring of the first pair of strings is almost twice as long as the longest common substring of the second pair of strings, although we changed only one character. This makes the longest common substring unsuitable to be used as a measure of similarity of two strings: Intuitively, changing one character must not change the measure of similarity much. To overcome this issue, it is natural to allow the substring to occur in \(T_1\) and \(T_2\) not exactly but with a small number of mismatches.
Problem 1.2
(LCS with \({{k}}\) Mismatches) Given two strings \(T_1, T_2\) of length n and an integer k, find a maximumlength substring of \(T_1\) that occurs in \(T_2\) with at most k mismatches.
The problem can be solved in quadratic time and space by a dynamicprogramming algorithm, but more efficient solutions have also been shown. The longest common substring with one mismatch problem was first considered in [6], where an \(O(n^2)\)time and \(O(n)\)space solution was given. This result was further improved by Flouri et al. [14], who showed an \(O(n \log n)\)time and \(O(n)\)space solution.
For a general value of k, the problem was first considered by Leimeister and Morgenstern [29], who suggested a greedy heuristic algorithm. Flouri et al. [14] showed that LCS with \({{k}}\) Mismatches admits a quadratictime algorithm which takes constant (additional) space. Grabowski [16] presented two outputdependent algorithms with running times \(O(n ((k+1) (\ell _0+1))^k)\) and \(O(n^2 k/\ell _k)\), where \(\ell _0\) is the length of the longest common substring of \(T_1\) and \(T_2\) and \(\ell _k\) is the length of the longest common substring with k mismatches of \(T_1\) and \(T_2\). Thankachan et al. [35] gave an \(O(n)\)space, \(O(n \log ^k n)\)time solution for \(k=O(1)\). Very recently, Charalampopoulos et al. [10] extended the underlying techniques and developed an \(O(n)\)time algorithm for the case of \(\ell _k = \varOmega (\log ^{2k+2} n)\). Finally, Abboud et al. [1] applied the polynomial method to develop a \(k^{1.5} n^2 / 2^{\varOmega (\sqrt{(\log n)/k})}\)time randomised solution to the problem. In fact, their algorithm was developed for a more general problem of computing the longest common substring with k edits, but it can be adapted to LCS with \({{k}}\) Mismatches as well. The problem of computing the longest common substring with k edits was also considered in [34], where an \(O(n \log ^k n)\)time solution was given for constant k.
1.2 Our Contribution
Our contribution is as follows. In Sect. 2, we show that existence of a strongly subquadratictime algorithm for LCS with \({{k}}\) Mismatches on strings over binary alphabet for \(k = \varOmega (\log n)\) refutes the Strong Exponential Time Hypothesis (SETH) of Impagliazzo, Paturi, and Zane [23, 24]; see also [11, Chapter 14]:
Hypothesis 1.3
(SETH) For every \(\delta > 0\), there exists an integer q such that SAT on qCNF formulas with m clauses and n variables cannot be solved in \(m^{O(1)} 2^{(1\delta ) n}\) time.
This conditional lower bound implies that there is little hope to improve existing solutions to LCS with \({{k}}\) Mismatches. To this end, we introduce a new problem, inspired by the work of Andoni and Indyk [4].
Problem 1.4
(LCS with Approximately \({\textsf {k}}\) Mismatches) Two strings \(T_1, T_2\) of length n, an integer k, and a constant \(\varepsilon > 0\) are given. If \(\ell _k\) is the length of the longest common substring with k mismatches of \(T_1\) and \(T_2\), return a substring of \(T_1\) of length at least \(\ell _k\) that occurs in \(T_2\) with at most \((1+\varepsilon ) \cdot k\) mismatches.
Let \(d_H(S_1,S_2)\) denote the Hamming distance between equallength strings \(S_1\) and \(S_2\), that is, the number of mismatches between them. Then we are to find the substrings \(S_1\) and \(S_2\) of \(T_1\) and \(T_2\), respectively, of length at least \(\ell _k\) such that \(d_H(S_1,S_2) \le (1+\varepsilon ) \cdot k\).
Although the problem statement is not standard, it makes perfect sense from the practical point of view. It is also more robust than the LCS with \({{k}}\) Mismatches problem, as for most applications it is not important whether a returned substring occurs in \(T_1\) and \(T_2\) with, for example, 10 or 12 mismatches. The result is also important from the theoretical point of view as it improves our understanding of the big picture of string comparison. In their work, Andoni and Indyk used the technique of localitysensitive hashing to develop a spaceefficient randomised index for a variant of the approximate pattern matching problem. We extend their work with new ideas in the construction and the analysis to develop a randomised subquadratictime solution to Problem 1.4. This result is presented in Sect. 3.
In Sect. 4, we consider approximation algorithms for the length of the LCS with \({{k}}\) Mismatches. By applying previous techniques, we show a strongly subquadratictime 2approximation algorithm and show that no strongly subquadratictime \((2\varepsilon )\)approximation algorithm exists for any \(\varepsilon >0\) unless SETH fails.
Finally, in Sect. 5 we show a strongly subcubictime solution for LCS with \({{k}}\) Mismatches for all \({\textsf {k}}\) by reducing it (for arbitrary alphabet size) to Binary Jumbled Indexing. Namely, we show that LCS with \({{k}}\) Mismatches for all \(k = 1, \dots , n\) can be solved in \(O(n^{2.859})\) expected time or in \(O(n^{2.864})\) deterministic time, improving upon naive computation performed for every k separately.
2 LCS with \({{k}}\) Mismatches is SETHHard
Recall that the Hamming distance of two strings U and V of the same length, denoted as \(d_H(U,V)\), is simply the number of mismatches. Our proof is based on conditional hardness of the following problem.
Problem 2.1
(Orthogonal Vectors) Given a set A of N vectors from \(\{0,1\}^d\) each, does there exist a pair of vectors \(U,V \in A\) that is orthogonal, i.e., \(\sum _{h=1}^d U[h] V[h]=0\)?
Williams showed a conditional lower bound for an equivalent problem called cooperative subset queries [37, Section 5.1], which immediately implies the following fact:
Fact 2.2
Suppose there is \(\varepsilon >0\) such that for all constant c, Orthogonal Vectors on a set of N vectors of dimension \(d=c\log N\) can be solved in \(2^{o(d)} \cdot N^{2\varepsilon }\) time. Then SETH is false.
We treat vectors from \(\{0,1\}^d\) as binary strings of length d. Let us introduce two morphisms, \(\mu \) and \(\tau \):
We will use the following two observations.
Observation 2.3
We have \(d_H(\mu (0),\tau (0))=d_H(\mu (0)\), \(\tau (1))=d_H(\mu (1),\tau (0))=1\), and \(d_H(\mu (1),\tau (1))=3\).
Observation 2.4
Let \(x,y,z \in \{0,1\}\). Then the string 1000 has exactly two occurrences in 1000xyz1000.
Let us also introduce a string gadget \(H=\gamma ^d\), where \(\gamma =100\,1000\). Note that \(\gamma \ne \mu (x)\) and \(\gamma \ne \tau (x)\) for \(x \in \{0,1\}\). Further, note that \(H=\mu (U)=\tau (U)=7d\) for any \(U \in A\).
Lemma 2.5
Consider a set of vectors \(A=\{U_1,\dots ,U_N\}\) from \(\{0,1\}^d\), the strings:
for some positive integer q, and \(k=d\). Then:

(a)
If the set A contains two orthogonal vectors, then the LCS with \({{k}}\) Mismatches problem for \(T_1\) and \(T_2\) has a solution of length at least \(\ell =(14q+7)d\).

(b)
If the set A does not contain two orthogonal vectors, then all the solutions for the LCS with \({{k}}\) Mismatches problem for \(T_1\) and \(T_2\) have length smaller than \(\ell '=(7q+14)d\).
Proof
(a) Assume that \(U_i\) and \(U_j\) are a pair of orthogonal vectors. \(T_1\) contains a substring \(H^q \mu (U_i) H^q\) and \(T_2\) contains a substring \(H^q \tau (U_j) H^q\). Both substrings have length \(\ell \) and, by Observation 2.3, their Hamming distance is exactly \(k = d\).
(b) Assume to the contrary that there are indices a and b for which the substrings \(S_1=T_1[a,a+\ell '1]\) and \(S_2=T_2[b,b+\ell '1]\) have at most k mismatches. First, let us note that \(7 \mid ab\). Indeed, otherwise \(S_1\) would contain at least \(\lfloor (\ell '3)/7 \rfloor =(q+2)k1\ge k+1\) substrings of the form 1000 which, by Observation 2.4, would not be aligned with substrings 1000 in \(S_2\). Hence, they would account for more than k mismatches between \(S_1\) and \(S_2\).
Let us call all the substrings of \(T_1\) and \(T_2\) that come from the 3character prefixes of \(\mu (0)\), \(\mu (1)\), \(\tau (0)\), \(\tau (1)\), and \(\gamma \) the core substrings, with core substrings that come from \(\gamma \) being gadget core substrings. We have already established that the core substrings of \(S_1\) and \(S_2\) are aligned. Moreover, \(S_1\) and \(S_2\) contain at least \(\lfloor (\ell '2)/7 \rfloor =(q+2)k1\) core substrings each. Amongst every \((q+2)k1\) consecutive core substrings in \(S_1\), some k consecutive must come from \(\mu (U_i)\) for some index i; a symmetric property holds for \(S_2\) and \(\tau (U_j)\). Moreover, as only the gadget core substrings in \(S_1\) and \(S_2\) can match exactly, at most k core substrings that are contained in \(S_1\) and \(S_2\) can be nongadget. Hence, \(S_1\) and \(S_2\) contain exactly k nongadget core substrings each. If they were not aligned, they would have produced more than k mismatches in total with the gadget core substrings.
Therefore, \(S_1\) and \(S_2\) must contain, as aligned substrings, \(\mu (U_i)[1,7d4]\) and \(\tau (U_j)[1,7d4]\) for some \(i,j \in \{1,\ldots ,N\}\), respectively. Hence, \(d_H(U_i, U_j) \le k\). By Observation 2.3, we conclude that \(U_i\) and \(U_j\) are orthogonal. \(\square \)
Theorem 2.6
Suppose there is \(\varepsilon > 0\) such that LCS with \({{k}}\) Mismatches can be solved in \(O(n^{2\varepsilon })\) time on strings over binary alphabet for \(k = \varOmega (\log n)\). Then SETH is false.
Proof
The reduction of Lemma 2.5 with \(q=1\) constructs, for an instance of the Orthogonal Vectors problem with N vectors of dimension d, an equivalent instance of the LCS with \({{k}}\) Mismatches problem with strings of length \(n=7d(2N+1)\) and \(k=d\). Thus, assuming that LCS with \({{k}}\) Mismatches can be solved in \(O(n^{2\varepsilon })\) time for \(k = \varOmega (\log n)\), the constructed instance can be solved in \(O(N^{2\varepsilon }d^{O(1)})\) time if \(d = c \log N\). This, by Fact 2.2, contradicts SETH. \(\square \)
3 LCS with Approximately k Mismatches
In this section, we prove the following theorem.
Theorem 3.1
Let \(\varepsilon \in (0,2)\) and \(\delta \in (0,1)\) be arbitrary constants. The LCS with Approximately \({\textsf {k}}\) Mismatches problem can be solved in \(O(n^{1+1/(1+\varepsilon )} )\) space and \(O(n^{1+1/(1+\varepsilon )} \log ^2 n)\) time with error probability \(\delta \).
3.1 Overview of the Proof
The classic solution to the longest common substring problem is based on two observations. The first observation is that the longest common substring of \(T_1\) and \(T_2\) is in fact the longest common prefix of some suffix of \(T_1\) and some suffix of \(T_2\). The second observation is that the maximal length of the longest common prefix of a fixed suffix S of \(T_1\) and suffixes of \(T_2\) is reached by one of the two suffixes of \(T_2\) that are closest to S in the lexicographic order. This suggests the following algorithm: First, build a suffix tree of \(T_1\) and \(T_2\), which contains all suffixes of \(T_1\) and \(T_2\) ordered lexicographically. Second, compute the longest common prefix of each suffix of \(T_1\) and the two suffixes of \(T_2\) closest to S in the lexicographic order, one from the left and one from the right. The problem of computing the longest common prefix has been extensively studied in the literature and a number of very efficient deterministic and randomised solutions exist [7, 8, 12, 19, 22]; for example, one can use a Lowest Common Ancestor (LCA) data structure, which can be constructed in linear time and space and answers longest common prefix queries in \(O(1)\) time [12, 19].
Our solution to the longest common substring with approximately k mismatches problem is somewhat similar. Instead of the lexicographic order, we will consider \(\varTheta (n^{1/(1+\varepsilon )})\) different orderings on the suffixes of \(T_1\) and \(T_2\). To define these orderings, we will use the localitysensitive hashing technique, which was initially introduced for the needs of computational geometry [18] and later adapted for substrings with Hamming distance [4]. In more detail, we will choose \(\varTheta (n^{1/(1+\varepsilon )})\) hash functions, where each function can be considered as a projection of a string of length n onto a random subset of its positions. By choosing the size of the subset appropriately, we will be able to guarantee that the hash function is localitysensitive: For any two strings at the Hamming distance at most k, the values of the hash functions on them will be equal with reasonably high probability, while the values of the hash functions on any pair of strings at the Hamming distance bigger than \((1+\varepsilon ) \cdot k\) will be equal with low probability. For each hash function, we will sort the suffixes of \(T_1\) and \(T_2\) by the lexicographic order on their hash values. As a corollary of the localitysensitive property, if two suffixes of \(T_1\) and \(T_2\) have a long common prefix with at most k mismatches, they are likely to be close to each other in at least one of the orderings.
However, we will not be able to compute the longest common prefix with \((1+\varepsilon ) k\) mismatches for all candidate pairs of suffixes exactly (the best data structure, based on the kangaroo method [15, 28], has query time \(\varTheta ((1+\varepsilon )k)\) which is \(\varTheta (n)\) in the worst case). We will use this method for only one pair of suffixes chosen at random from a carefully preselected set of candidate pairs. For other candidate pairs, we will use \(\mathrm {LCP}_{\tilde{k}}\) queries. In an \(\mathrm {LCP}_{\tilde{k}}\) query, we are given two suffixes \(S_1, S_2\) of \(T_1\) and \(T_2\), respectively, and must output any integer \(\ell \) such that \(\mathrm {LCP}_{k}(S_1,S_2)\le \ell \le \mathrm {LCP}_{(1+\varepsilon )k}(S_1,S_2)\), where \(\mathrm {LCP}_{k}\) and \(\mathrm {LCP}_{(1+\varepsilon )k}\) denote the longest common prefix with at most k and at most \((1+\varepsilon )k\) mismatches, respectively. In Sect. 3.2, we show the following lemma based on the sketching techniques by Kushilevitz et al. [27]:
Lemma 3.2
For given k and \(\varepsilon \), after \(O(n \log ^3 n)\)time and \(O(n \log ^2 n)\)space preprocessing of strings \(T_1,T_2\), any \(\mathrm {LCP}_{\tilde{k}}\) query can be answered in \(O(\log ^2 n)\) time. With probability at least \(11/n^3\), the preprocessing produces a data structure that correctly answers all \(\mathrm {LCP}_{\tilde{k}}\) queries.
The key idea is to compute sketches for all poweroftwo length substrings of \(T_1\) and \(T_2\). The sketches will have logarithmic length (so that we will be able to compare them very fast) and the Hamming distance between them will be roughly proportional to the Hamming distance between the original substrings. Once the sketches are computed, we use binary search to answer \(\mathrm {LCP}_{\tilde{k}}\) queries in polylogarithmic time.
3.2 Proof of Lemma 3.2
During the preprocessing stage, we compute sketches [27] of all substrings of the strings \(T_1\) and \(T_2\) of lengths \(\ell = 1, 2, 4, \ldots , 2^{\lfloor \log n \rfloor }\), which can be defined in the following way. Without loss of generality, assume that the alphabet is \(\varSigma = \{0,1,\ldots ,p1\}\), where p is a prime number. For a fixed \(\ell \), choose \(\lambda = \lceil 3\ln n / \gamma ^2 \rceil \) vectors \(r_\ell ^i\) of length \(\ell \), where \(\gamma \) is a constant to be defined later, such that the values \(r_\ell ^i [j]\) across \(i = 1, 2, \dots , \lambda \) and \(j = 1, 2, \dots , \ell \) are independent and identically distributed so that for every \(a\in \varSigma \):
For a string X of length \(\ell \), we define the sketch \(\mathrm {sk}(X)\) to be a vector of length \(\lambda \), where \(\mathrm {sk}(X) [i] = r_\ell ^i \cdot X \pmod p\). For each \(i = 1, 2, \ldots , \lambda \), we compute the inner product of \(r_\ell ^i\) with all length\(\ell \) substrings of \(T_1\) and \(T_2\) in \(O(n \log n)\) time by running the Fast Fourier Transform (FFT) algorithm in the field \(\mathbb {Z}_p\) [13]. As a result, we obtain the sketches of each length\(\ell \) substring of \(T_1\) and \(T_2\). We repeat this step for all specified values of \(\ell \). One instance of the FFT algorithm takes \(O(n \log n)\) time, and we run an instance for each \(i = 1, 2, \ldots , \lambda \) and for each \(\ell = 1, 2, 4, \ldots , 2^{\lfloor \log n \rfloor }\), which takes \(O(n \log ^3 n)\) time in total. The sketches occupy \(O(n \log ^2 n)\) space. Each string S can be decomposed uniquely as \(X_1 X_2 \ldots X_g\), where \(g = O(\log n)\) and \(X_1> X_2> \ldots > X_g\) are powers of two; we define a sketch \(\mathrm {sk}(S) = \sum _q \mathrm {sk}(X_q) \pmod p\). Let \(\delta _1 = \frac{p1}{p} (1  (1\frac{1}{2k})^{k})\) and \(\delta _2 = \frac{p1}{p} (1  (1\frac{1}{2k})^{(1+\varepsilon ) \cdot k})\).
Lemma 3.3
(see [27]) Let \(S_1,S_2\) be strings of the same length. For each \(i=1,\ldots ,\lambda \):

if \(d_H(S_1, S_2) \le k\), then \(\mathrm {sk}(S_1)[i] \ne \mathrm {sk}(S_2)[i]\) with probability at most \(\delta _1\);

if \(d_H(S_1, S_2) \ge (1+\varepsilon ) \cdot k\), then \(\mathrm {sk}(S_1)[i] \ne \mathrm {sk}(S_2)[i]\) with probability at least \(\delta _2\).
Proof
We use a different interpretation of \(r_\ell ^i\) that defines the same distribution. We start with the zero vector and sample positions with probability \(\frac{1}{2k}\). For each sampled position j, we decide on the value \(r_{\ell }^i[j]\in \varSigma \) independently and uniformly at random. Let \(m = d_H(S_1, S_2)\) and \(a_1, \ldots , a_m\) be the positions of the mismatches between the two strings. If none of the positions \(a_1, \ldots , a_m\) are sampled, then \(\mathrm {sk}(S_1)[i] = \mathrm {sk}(S_2)[i]\). Otherwise, if \(a_{j_1},\ldots ,a_{j_g}\) are sampled, for each \(r_{\ell }^i[a_{j_1}], \ldots , r_{\ell }^i[a_{j_{g1}}]\) exactly one of the p choices of \(r_{\ell }^i[a_{j_g}]\) results in \(\mathrm {sk}(S_1)[i] = \mathrm {sk}(S_2)[i]\) (because p is prime). Hence, the probability that \(\mathrm {sk}(S_1)[i] \ne \mathrm {sk}(S_2)[i]\) is equal to \(\frac{p1}{p}(1(1  \frac{1}{2k})^{m})\), which is at most \(\delta _1\) if \(d_H(S_1, S_2) \le k\), and at least \(\delta _2\) if the Hamming distance is at least \((1+\varepsilon ) \cdot k\). \(\square \)
We set \(\varDelta = \frac{\delta _1+\delta _2}{2} \cdot \lambda \) and \(\gamma = \frac{\delta _2\delta _1}{2}\). Observe that
because \((1\frac{1}{2k})^k\) is an increasing function of k bounded from above by \(e^{1/2}\). Consequently, if \(\varepsilon \) is a constant, then \(\gamma \) is a constant as well.
Lemma 3.4
For all strings \(S_1\) and \(S_2\) of the same length, the following claims hold with probability at least \(1n^{6}\):

if \(d_H(\mathrm {sk}(S_1),\mathrm {sk}(S_2))>\varDelta \), then \(d_H(S_1, S_2)>k\);

if \(d_H(\mathrm {sk}(S_1),\mathrm {sk}(S_2)) \le \varDelta \), then \(d_H(S_1,S_2) < (1+\varepsilon ) \cdot k\).
Proof
Let \(\chi _i\) be an indicator random variable that is equal to one if and only if \(\mathrm {sk}(S_1) [i] \ne \mathrm {sk}(S_2) [i]\). The claim follows immediately from Lemma 3.3 and the following Chernoff–Hoeffding bounds [20, Theorem 1]. For \(\lambda \) independently and identically distributed binary variables \(\chi _1, \chi _2, \ldots , \chi _\lambda \), we have
where \(\mu = \mathrm {Pr}[\chi _i = 1]\). Recall that \(\gamma = \frac{\delta _2  \delta _1}{2}\), so we obtain that the error probability is at most \(e^{2 \lambda \gamma ^2} \le n^{6}\).
If \(d_H(S_1,S_2) \le k\), Lemma 3.3 asserts that \(\mu \le \delta _1\). By the first of the above inequalities, we have that \(d_H(\mathrm {sk}(S_1),\mathrm {sk}(S_2)) \le \varDelta \) with probability at least \(1n^{6}\). Hence, if \(d_H(\mathrm {sk}(S_1),\mathrm {sk}(S_2)) > \varDelta \), then \(d_H(S_1,S_2) > k\) with the same probability.
If \(d_H(S_1,S_2) \ge (1+\varepsilon ) \cdot k\), Lemma 3.3 asserts that \(\mu \ge \delta _2\). By the second inequality, we have that \(d_H(\mathrm {sk}(S_1),\mathrm {sk}(S_2)) > \varDelta \) with probability at least \(1n^{6}\). Hence, if \(d_H(\mathrm {sk}(S_1),\mathrm {sk}(S_2)) \le \varDelta \), then \(d_H(S_1,S_2) < (1+\varepsilon )\cdot k\) with the same probability. \(\square \)
Suppose we wish to answer an \(\mathrm {LCP}_{\tilde{k}}\) query on two suffixes \(S_1, S_2\). It suffices to find the longest prefixes of \(S_1, S_2\) such that the Hamming distance between their sketches is at most \(\varDelta \). As mentioned above, these prefixes can be represented uniquely as a concatenation of strings of poweroftwo lengths \(\ell _1> \ell _2> \ldots > \ell _g\). To compute \(\ell _1\), we initialise it with the biggest power of two not exceeding n and compute the Hamming distance between the sketches of the corresponding substrings. If it does not exceed \(\varDelta \), we have found \(\ell _1\); otherwise, we divide \(\ell _1\) by two and continue. Suppose that we already know \(\ell _1, \ell _2, \ldots , \ell _i\) and the sketches \(\mathrm {sk}(S_1[1,d_i])\) and \(\mathrm {sk}(S_2[1,d_i])\), where \(d_i = \ell _1+\cdots + \ell _i\). To determine \(\ell _{i+1}\), we initialise it with \(\frac{1}{2}\ell _i\) and then divide it by two until \(d_H(\mathrm {sk}(S_1[1,d_i+\ell _{i+1}]),\mathrm {sk}(S_2[1,d_i+\ell _{i+1}]))\le \varDelta \). These two sketches can be computed in \(O(\lambda )=O(\log n)\) time by combining \(\mathrm {sk}(S_1[1,d_i])\) and \(\mathrm {sk}(S_2[1,d_i])\) with the precomputed sketches \(\mathrm {sk}(S_1[d_i+1,d_i+\ell _{i+1}])\) and \(\mathrm {sk}(S_2[d_i+1,d_i+\ell _{i+1}])\), respectively. Consequently, the query procedure takes \(O(\log ^2 n)\) time. It errs on at least one query with probability at most \(n^{3}\) (Lemma 3.4 is only applied for pairs of samelength substrings of \(T_1\) and \(T_2\), so we estimate error probability by the union bound). This completes the proof of Lemma 3.2.
3.3 Proof of Theorem 3.1
We start by preprocessing \(T_1\) and \(T_2\) as described in Lemma 3.2. In the main phase of the algorithm, we construct a family \(\mathbf {H}\) of hash functions based on four parameters \(m,s,t,w\in \mathbb {Z}\) to be specified later.
Let \(\varPi \) be the set of all projections of strings of length n onto a single position, i.e. the value \(\pi _i(S)\) of the ith projection on a string S is simply its ith character S[i]. More generally, for a string S of length n and a function \(h=(\pi _{a_1},\ldots ,\pi _{a_q}) \in \varPi ^q\), we define h(S) as \(S[a_{p_1}] S[a_{p_2}] \cdots S[a_{p_q}]\), where p is a permutation such that \(a_{p_1}\le \cdots \le a_{p_q}\). If \(S<n\), we define \(h(S) := h(S \cdot \$^{nS})\), where \( \$\notin \varSigma \) is a special gapfilling character.
Each hash function \(h\in \mathbf {H}\) is going to be a uniformly random element of \(\varPi ^{mt}\); however, the individual hash functions are not chosen independently in order to ensure faster running time for the algorithm. Nevertheless, \(\mathbf {H}\) will be composed of s independent subfamilies \(\mathbf {H}_i\), each of size \(\left( {\begin{array}{c}w\\ t\end{array}}\right) \). To construct \(\mathbf {H}_i\), we choose w functions \(u_{i,1},\ldots ,u_{i,w}\in \varPi ^m\) independently and uniformly at random. Each hash function \(h\in \mathbf {H}_i\) is defined as an unordered ttuple of distinct functions \(u_{i,r}\). Formally,
Consider the set of all suffixes \(S_1, S_2, \ldots , S_{2n}\) of \(T_1\) and \(T_2\). For each \(h\in \mathbf {H}\), we define an ordering \(\prec _h\) of the suffixes \(S_1,\ldots ,S_{2n}\) according to the lexicographic order of the values \(h(S_j)\) of the hash function and, in case of ties, according to the lengths \(S_j\). To construct it, we build a compact trie^{Footnote 1} on strings \(h(S_1), h(S_2), \ldots , h(S_{2n})\).
Theorem 3.5
Functions \(u_{i,r}\) for \(i=1,\ldots ,s\) and \(r=1,\ldots ,w\) can be preprocessed in \(O(n^{4/3}\log ^{4/3} n)\) time and \(O(n)\) space each, i.e., in \(O(sw n^{4/3}\log ^{4/3} n)\) time and \(O(sw n)\) space in total, so that afterwards, for each \(h\in \mathbf {H}\), a trie on \(h(S_1), \ldots , h(S_{2n})\) can be constructed in \(O(tn\log n)\) time and \(O(n)\) space. The preprocessing errs with probability \(O(1/n)\) for each \(u_{i,r}\), i.e., \(O(sw/n)\) in total.
Let us defer the proof of the theorem until we complete the description of the algorithm and derive Theorem 3.1. We preprocess functions \(u_{i,r}\) and build a trie on \(h(S_1), \dots , h(S_{2n})\) for each \(h\in \mathbf {H}_i\). We then augment the trie with an LCA data structure, which can be done in linear time and space [12, 19]. The latter can be used to find in constant time the longest common prefix of any two strings \(h(S_j)\) and \(h(S_{j'})\).
Consider a function \(h\in \mathbf {H}\) and a positive integer \(\ell \le n\). We define \(h\big _{[\ell ]}\) so that
In other words, if h is a projection onto positions from a multiset \(\mathsf {P}\), then \(h\big _{[\ell ]}\) is a projection onto positions from the multiset \(\{p\in \mathsf {P}: p \le \ell \}\), extended with $’s to length mt. Consequently, \(h\big _{[\ell ]}(S)=h\big _{[\ell ]}(S')\) if and only if the longest common prefix of h(S) and \(h(S')\) is at least \(\{p\in \mathsf {P}: p \le \ell \}\) characters long.
We define the family of collisions \(\mathbf {C}^{\mathbf {H}}_\ell \) as a set of triples \((S,S',h)\) such that S and \(S'\) are suffixes of \(T_1\) and \(T_2\), respectively, both of length at least \(\ell \), and \(h\in \mathbf {H}\) is such that the suffixes collide on \(h\big _{[\ell ]}\), that is, \(h\big _{[\ell ]}(S)=h\big _{[\ell ]}(S')\). Note that the families of collisions are nested: \(\mathbf {C}^{\mathbf {H}}_{0}\supseteq \cdots \supseteq \mathbf {C}^{\mathbf {H}}_{\ell }\supseteq \mathbf {C}^{\mathbf {H}}_{\ell +1}\supseteq \cdots \supseteq \mathbf {C}^{\mathbf {H}}_{n}\).
For a fixed function h, we define the \(\ell \)neighbourhood of S as the set of suffixes \(S'\) of \(T_2\) such that \((S,S',h)\in \mathbf {C}^{\mathbf {H}}_\ell \). We observe that the \(\ell \)neighbourhood of S forms a contiguous range in the sequence of suffixes of \(T_2\) ordered according to \(\prec _h\), and this range can be identified in \(O(\log n)\) time using binary search and LCA queries on the trie constructed for h. Consequently, an \(O(n\mathbf {H})\)space representation of \(\mathbf {C}^{\mathbf {H}}_\ell \), with one range for every \(\ell \)neighbourhood of each suffix S, can be constructed in \(O(n\mathbf {H}\log n)\) time.
In the algorithm, we find the largest \(\ell \) such that \(\mathbf {C}^{\mathbf {H}}_{\ell }\ge 2n\mathbf {H}\); using a binary search, this takes \(O(n\mathbf {H}\log ^2 n)\) time. For each \((S,S',h)\in \mathbf {C}^{\mathbf {H}}_{\ell +1}\), we compute the longest common prefix with approximately k mismatches \(\mathrm {LCP}_{\tilde{k}}(S,S')\) (Lemma 3.2). Additionally, we pick a single element \((\bar{S},\bar{S}',\bar{h})\in \mathbf {C}^{\mathbf {H}}_{\ell }\) uniformly at random and compute the longest common prefix with at most \((1+\varepsilon )k\) mismatches \(\mathrm {LCP}_{(1+\varepsilon )k}(\bar{S},\bar{S}')\) naively in \(O(n)\) time. The longest of the retrieved prefixes is returned as an answer.
The algorithm is summarised in the pseudocode above. We will now proceed to the analysis of its complexity and correctness.
3.4 Complexity and Correctness
To ensure the complexity bounds and correctness of the algorithm, we must carefully choose the parameters s, t, w, and m. Let \(p_1 = 1  k / n\), \(p_2 = 1  (1+\varepsilon ) \cdot k / n\), and \(\rho =\log p_1 / \log p_2\). The intuition behind these values is that if S and \(S'\) are two strings of length n and \(d_H(S,S')\le k\), then \(p_1\) is a lower bound for the probability of \(S[i]=S'[i]\) for a uniformly random position i. On the other hand, \(p_2\) is an upper bound for the same probability if \(d_H(S,S')\ge (1+\varepsilon )\cdot k\). Based on these values, we define
We assume that \((1+\varepsilon )k<n\) in order to guarantee \(p_1>p_2>0\). Note that if \((1+\varepsilon )k \ge n\), the problem is trivial.
3.4.1 Complexity
To show the complexity of the algorithm, we will start with a simple observation and a more involved fact.
Observation 3.6
We have \(s = n^{o(1)}\) and \(w=n^{o(1)}\).
Proof
First, observe
Similarly,
Moreover, \(p_1 > p_2\) yields \(\log p_1 > \log p_2\) and therefore \(\rho = \frac{\log p_1}{\log p_2} < 1\). Consequently, \(w = O(\log n) + 2^{O(\sqrt{\log n})} = n^{o(1)}\), which concludes the proof. \(\square \)
Fact 3.7
We have \(\mathbf {H} = O(n^{1/(1+\varepsilon )})\).
Proof
Observe that \(\mathbf {H} = s \genfrac(){0.0pt}1{w}{t} = O(t! \genfrac(){0.0pt}1{w}{t}) = O(w^t)\). To estimate the latter, we consider two cases. If \(w \le 3t^3\), then
Otherwise, \(p_1^{m} \ge w1t^2 \ge 3t^3t^2\ge t^3+t\). Consequently,
Thus, it suffices to prove that \(p_1^{mt} = O(n^{1/(1+\varepsilon )})\). We have
Moreover, due to \((1+\varepsilon )k < n\) and \(\varepsilon = \varTheta (1)\), we have
On the other hand, taking the Taylor’s expansion of \(f(x)=\tfrac{\log (1x)}{\log (1(1+\varepsilon )x)}\), which is concave for \(0\le x < \frac{1}{1+\varepsilon }\), we obtain
Consequently,
Thus, \(p_1^{mt} \le n^{1/(1+\varepsilon )}\) holds for sufficiently large n and therefore \(\mathbf {H} = O(w^t) = O(e \cdot p_1^{mt}) = O(n^{1/(1+\varepsilon )}),\) which concludes the proof. \(\square \)
Lemma 3.8
The running time of the algorithm is \(O(n^{1+1/(1+\varepsilon )} \log ^2 n)\).
Proof
Preprocessing for \(\mathrm {LCP}_{\tilde{k}}\) queries takes \(O(n \log ^3 n)\) time (Lemma 3.2), whereas functions \(u_{i,r}\) are processed in \(O(ws\cdot n^{4/3} \log ^{4/3} n)\) overall time using Theorem 3.5. Afterwards, for each hash function \(h\in \mathbf {H}\) we can build a trie and an LCA data structure on strings \(h(S_1),\ldots ,h(S_{2n})\) in \(O(t n \log n)\) time, which is \(O(\mathbf {H}tn\log n)\) in total. Next, the value \(\ell \) and the family \(\mathbf {C}_{\ell +1}^{\mathbf {H}}\) are computed in \(O(\mathbf {H}n\log ^2 n)\) time. The time for \(\mathbf {C}_{\ell +1}^{\mathbf {H}} < 2 n \mathbf {H}\) \(\mathrm {LCP}_{\tilde{k}}\) queries is bounded by the same function. Finally, we answer one \(\mathrm {LCP}_{(1+\varepsilon )k}\) query, which takes \(O(n)\) time. The overall running time is
due to Observation 3.6 and Fact 3.7. We can hide the first term because of \(\varepsilon < 2\).\(\square \)
Lemma 3.9
The space complexity of the algorithm is \(O(n^{1+1/(1+\varepsilon )})\).
Proof
The data structure for \(\mathrm {LCP}_{\tilde{k}}\) queries requires \(O(n \log ^2 n)\) space. Preprocessing functions \(u_{i,r}\) requires \(O(sw n) = O(n^{1+o(1)})\) space and the tries occupy \(O(\mathbf {H} \cdot n) = O(n^{1+1/(1+\varepsilon )})\) space. \(\square \)
3.4.2 Correctness
First, let us focus on two suffixes which yield the longest common substring with exactly k mismatches.
Lemma 3.10
Let S and \(S'\) be suffixes of \(T_1\) and \(T_2\), respectively, that maximise \(\mathrm {LCP}_{k}(S,S')\), i.e., such that \(\mathrm {LCP}_{k}(S,S')=\ell _k\). For each \(i\in \{1,\ldots ,s\}\), with probability \(\varOmega (1/t!)\) there exists \(h\in \mathbf {H}_i\) such that \(h\big _{[\ell _k]}(S)=h\big _{[\ell _k]}(S')\).
Proof
By definition of \(\ell _k\), we have \(d_H(S[1,\ell _k],S'[1,\ell _k])\le k\). Moreover, for any hash function h we have that \(h\big _{[\ell _k]} (S) = h\big _{[\ell _k]}(S')\) if and only if \(h(S[1,\ell _k]) = h(S'[1,\ell _k])\). Let us recall that each hash function \(h\in \mathbf {H}_i\) is a ttuple of functions \(u_{i,r}\in \varPi ^m\). Consequently, \(h(S[1,\ell _k]) = h(S'[1,\ell _k])\) for some \(h\in \mathbf {H}_i\) if and only if the strings \(S[1,\ell _k] \$^{n\ell _k}\) and \(S'[1,\ell _k] \$^{n\ell _k}\) collide on at least t out of w functions \(u_{i,r}\). We shall give a lower bound on the probability \(\mu \) of this event. Individual collisions are independent and each of them holds with the same probability \(q=p_1^m\). Moreover, \(\mu \) may only increase as we increase q, so we can replace q by a lower bound \(\frac{1}{w}\). (Note that \(w = t^2+ \lceil {p_1^{m}}\rceil \ge \lceil {q^{1}\rceil }\ge q^{1}\).) We have
Hence,
where the latter is true because \(w\ge t^2\) and \(w\ge 2\). Consequently, \(\mu = \varOmega (1/t!)\). \(\square \)
As a corollary, we can choose a constant in the number of steps \(s=\varTheta (t!)\) so that \((S,S',h)\in \mathbf {C}^{\mathbf {H}}_{\ell _k}\) for some \(h\in \mathbf {H}\) holds with probability at least \(\frac{3}{4}\). If additionally \(\ell _k > \ell \), then \((S,S',h) \in \mathbf {C}^{\mathbf {H}}_{\ell +1}\), so \(\mathrm {LCP}_{\tilde{k}}(S,S')\) will be called and with high probability will return a substring of length \(\ge \ell _k\). Otherwise, \(\mathbf {C}_{\ell _k}^{\mathbf {H}}\ge 2n\mathbf {H}\) and we claim that a uniformly random \((\bar{S},\bar{S}',\bar{h})\in \mathbf {C}_{\ell }^{\mathbf {H}}\) satisfies \(\mathrm {LCP}_{(1+\varepsilon )k}(\bar{S},\bar{S}')\ge \ell \ge \ell _k\) with probability at least \(\frac{1}{2}\). To prove this, we first introduce a family \(\mathbf {B}^{\mathbf {H}}\) of bad collisions: triples \((S,S',h)\) which belong to \(\mathbf {C}_{\ell }^{\mathbf {H}}\) for some \(\ell >\mathrm {LCP}_{(1+\varepsilon )k}(S,S')\), and bound its expected size.
Lemma 3.11
The expected number of bad collisions satisfies \(\mathbb {E}[\mathbf {B}^{\mathbf {H}}] \le n\mathbf {H}\).
Proof
Let us bound the probability that \((S,S',h)\in \mathbf {B}^{\mathbf {H}}\) for fixed suffixes S and \(S'\) (of \(T_1\) and \(T_2\), respectively) and fixed \(h=(u_{i,r_1},\ldots ,u_{i,r_t})\). Equivalently, we shall bound \(\mathrm {Pr}[(S,S',h)\in \mathbf {C}^{\mathbf {H}}_{\ell }]\) for \(\ell =\mathrm {LCP}_{(1+\varepsilon )k}(S,S')+1\).
If \(S<\ell \) or \(S'<\ell \), the probability is 0 by the definition of \(\mathbf {C}^{\mathbf {H}}_{\ell }\). Otherwise, we observe that \(d_H(S[1,\ell ],S'[1,\ell ])>(1+\varepsilon )k\) and that h can be considered (due to its marginal distribution) as a projection onto mt uniformly random positions. Therefore,
where the last inequality follows from the definition of m, which yields \(mt \ge \log _{p_2} \frac{1}{n}\).
In total, we have \(n^2 \mathbf {H}\) possible triples \((S,S',h)\) so by linearity of expectation, we conclude that the expected number of bad collisions is at most \(\frac{1}{n}n^2 \mathbf {H}=n\mathbf {H}\). \(\square \)
Corollary 3.12
Let \((S,S',h)\) be a uniformly random element of \(\mathbf {C}_{\ell }^{\mathbf {H}}\), where \(\ell \) is a random variable which always satisfies \(\mathbf {C}_\ell ^{\mathbf {H}}\ge 2n\mathbf {H}\). We have \(\mathrm {Pr}[(S,S',h)\in \mathbf {B}^{\mathbf {H}}] \le \frac{1}{2}\).
Proof
More formally, we shall prove that \(\Pr [(S,S',h)\in \mathbf {B}^{\mathbf {H}} \mid (S,S',h)\in \mathbf {C}_{\ell }^{\mathbf {H}}]\le \frac{1}{2}\) holds for a uniformly random triple \((S,S',h)\). Indeed:
\(\square \)
Below, we combine the previous results to prove that with constant probability Algorithm 1 correctly solves the Approximate LCS with \({{k}}\) Mismatches problem. Note that we can reduce the error probability to an arbitrarily small constant \(\delta >0\): it suffices to repeat the algorithm a constant number of times and among the resulting pairs, choose the longest substrings successfully verified to be at Hamming distance at most \((1+\varepsilon )k\); verification can be implemented naively in \(O(n)\) time.
Corollary 3.13
With nonzero constant probability, Algorithm 1 succeeds — it reports a substring of \(T_1\) and a substring of \(T_2\) at Hamming distance at most \((1+\varepsilon )k\), both of length at least \(\ell _k\), where \(\ell _k\) is the length of the longest common substring with k mismatches.
Proof
We will prove that the algorithm succeeds conditioned on the following events:

the preprocessing of Lemma 3.2 succeeds,

the preprocessing of Theorem 3.5 succeeds for each function \(u_{i,r}\),

\(\mathbf {C}_{\ell _k}^{\mathbf {H}}\) contains \((S,S',h)\) such that \(\mathrm {LCP}_{k}(S,S')= \ell _k\) (see Lemma 3.10),

the randomly chosen \((\bar{S},\bar{S}',\bar{h})\in \mathbf {C}_{\ell }^{\mathbf {H}}\) does not belong to \(\mathbf {B}^{\mathbf {H}}\) (see Corollary 3.12).
This assumption holds with probability \(\varOmega (1)\), because probability of the complementary event can be bounded as follows using the union bound applied on the top of Lemma 3.2, Theorem 3.5, Lemma 3.10, and Corollary 3.12:
Successful preprocessing of functions \(u_{i,r}\) guarantees that the value \(\ell \) and the families \(\mathbf {C}_{\ell }^{\mathbf {H}}\) and \(\mathbf {C}_{\ell +1}^{\mathbf {H}}\) have been computed correctly. If \(\ell _k > \ell \), then \(\mathbf {C}_{\ell +1}^{\mathbf {H}}\) contains \((S,S',h)\) such that \(\mathrm {LCP}_{k}(S,S')=\ell _k\). The correctness of \(\mathrm {LCP}_{\tilde{k}}\) queries asserts that \(\mathrm {LCP}_{\tilde{k}}(S,S')\ge \ell _k\), so the algorithm considers prefixes of S and \(S'\) of length at least \(\ell _k\) as candidates for the resulting substrings. If \(\ell _k \le \ell \), on the other hand, then the randomly chosen \((\bar{S},\bar{S}',\bar{h})\in \mathbf {C}_{\ell }^{\mathbf {H}}\) satisfies \(\mathrm {LCP}_{(1+\varepsilon )k}(\bar{S},\bar{S}')\ge \ell \ge \ell _k\), so the algorithm considers prefixes of \(\bar{S}\) and \(\bar{S}'\) of length at least \(\ell \ge \ell _k\). In either case, a pair of substrings of length at least \(\ell _k\) and at Hamming distance at most \((1+\varepsilon )k\) is among the considered candidates. The resulting substrings also satisfy these conditions, because we return the longest candidates and the correctness of \(\mathrm {LCP}_{\tilde{k}}\) queries asserts that no substrings at distance more than \((1+\varepsilon )k\) are considered. \(\square \)
3.5 Proof of Theorem 3.5
Recall that each \(h\in \mathbf {H}\) is a ttuple of functions \(u_{i,r}\), i.e. \(h = (u_{i,r_1}, u_{i,r_2}, \ldots , u_{i,r_t})\), where \(1\le i \le s\) and \(1 \le r_1< r_2< \cdots < r_t \le w\). We will show a preprocessing of functions \(u_{i,r}\) after which we will be able to compute the longest common prefix of any two strings \(u_{i,r} (S_j), u_{i,r}(S_{j'})\) in \(O(1)\) time. As a result, we will be able to compute the longest common prefix of \(h(S_j), h(S_{j'})\) in \(O(t)\) time. It also follows that we will be able to compare any two strings \(h(S_j), h(S_{j'})\) in \(O(t)\) time as the order \(\prec _h\) is defined by the character following the longest common prefix (or by the lengths \(S_j\) and \(S_{j'}\) if \(h(S_j)=h(S_{j'})\)). Therefore, we can sort strings \(h(S_1), h(S_2), \ldots , h(S_{2n})\) in \(O(t n \log n)\) time and \(O(n)\) space and then compute the longest common prefix of each two adjacent strings in \(O(tn)\) time. The trie on \(h(S_1), h(S_2), \ldots , h(S_{2n})\) can then be built in \(O(n)\) time by imitating its depthfirst traverse.
It remains to explain how we preprocess individual functions \(u_{i,r}\). For each function, it suffices to build a trie on strings \(u_{i,r}(S_1), u_{i,r}(S_2), \ldots , u_{i,r}(S_{2n})\) and to augment it with an LCA data structure [12, 19]. We will consider two different methods for constructing the trie with time dependent on m. No matter what the value of m is, one of these methods will have \(O(n^{4/3}\log ^{4/3} n)\) running time. Let \(u_{i,r}\) be a projection onto a multiset \(\mathsf {P}\) of positions \(1 \le a_1 \le a_2 \le \cdots \le a_m \le n\) and denote \(T = T_1 \$^{n} T_2 \$^{n}\).
Lemma 3.14
The trie on \(u_{i,r}(S_1), \dots , u_{i,r}(S_{2n})\) can be constructed in \(O(\sqrt{m} n \log n)\) time and \(O(n)\) space correctly with error probability at most 1 / n.
Proof
Without loss of generality assume that \(\sqrt{m}\) is integer. Let us partition \(\mathsf {P}\) into subsets \(B_1, \dots , B_{\sqrt{m}}\), where
Now \(u_{i,r}\) can be represented as a \(\sqrt{m}\)tuple of projections \(b_1, b_2, \ldots , b_{\sqrt{m}}\) onto the subsets \(B_1, B_2, \ldots , B_{\sqrt{m}}\), respectively. We will build the trie by layers to avoid space overhead. Suppose that we have built the trie for a function \((b_1, b_2, \ldots , b_{\ell 1})\) and we want to extend it to the trie for \((b_1, b_2, \ldots , b_{\ell 1}, b_\ell )\).
Let p be a prime of value \(\varOmega (n^5)\). With error probability inverse polynomial in n, we can find such p in \(O(\log ^{O(1)}n)\) time; see [2, 33]. We choose a uniformly random \(r \in \mathbb {F}_p\) and create a vector \(\chi \) of length n. We initialise \(\chi \) as a zero vector and for each position \(a_{\ell ,q}\in B_\ell \), we increase \(\chi [a_{\ell ,q}]\) by \(r^{q}\). We then run the FFT algorithm for \(\chi \) and T in the field \(\mathbb {Z}_p\) [13]. The output of the FFT algorithm contains the inner products of \(\chi \) and all suffixes \(S_1, S_2, \ldots , S_{2n}\). The inner product of \(\chi \) and a suffix \(S_j\) is the Karp–Rabin fingerprint [25] \(\varphi _{\ell ,j}\) of \(b_\ell (S_j)\), where
If the fingerprints of \(b_\ell (S_j)\) and \(b_\ell (S_{j'})\) are equal, then \(b_\ell (S_j)\) and \(b_\ell (S_{j'})\) are equal with probability at least \(11/n^4\), and otherwise they differ (for a proof, see e.g. [30]).
For a fixed leaf of the trie for \((b_1, b_2, \ldots , b_{\ell 1})\), we first sort all the suffixes that end in it by fingerprints \(\varphi _{\ell ,j}\). Second, we lexicographically sort the strings \(b_\ell (S_{j})\) with distinct fingerprints. For this, we need to be able to compare \(b_\ell (S_j)\) and \(b_\ell (S_{j'})\) and to find the first character where they differ. We compare \(b_\ell (S_j)\) and \(b_\ell (S_{j'})\) characterbycharacter in \(O(\sqrt{m})\) time. We then append the leaf of the trie for \((b_1, b_2, \ldots , b_{\ell 1})\) with a trie on strings \(b_\ell (S_j)\) that can be built by imitating its depthfirst traverse.
By the union bound, the error probability is at most \(\frac{1}{n^4} \cdot n^2 \sqrt{m} \le \frac{1}{n}\). We now analyse the complexity of the algorithm. For each of the \(\sqrt{m}\) layers, the FFT algorithm takes \(O(n \log n)\) time. The sort by fingerprints takes \(O(n \log n)\) time per layer, or \(O(\sqrt{m} n \log n)\) time in total. We finally need to estimate the total number of characterbycharacter comparisons in all the layers. We claim that it can be upper bounded by \(O(n \log n)\). The reason for that is as follows: if we consider the resulting trie for \(u_{i,r}(S_1), \dots , u_{i,r}(S_{2n})\), it has size \(O(n)\). Imagine that the layers cut this trie into a number of smaller tries. The total size of these tries is still \(O(n)\), and we build each of these tries using characterbycharacter comparisons. For a trie of size x, we need \(O(x \log x)\) comparisons, which in total is \(O(n \log n)\). Therefore, the characterbycharacter comparisons take \(O(\sqrt{m} n \log n)\) time in total. \(\square \)
The second method builds the trie using the algorithm described in the first paragraph of this section: we only need to give a method for computing the longest common prefix of \(u_{i,r} (S_j)\) and \(u_{i,r}(S_{j'})\) (or, equivalently, the first position where \(u_{i,r} (S_j)\) and \(u_{i,r}(S_{j'})\) differ). The following lemma shows that this query can be answered in \(O(n \log n / m)\) time, which gives \(O(n^2 \log ^2 n / m)\) time complexity of the trie construction.
Lemma 3.15
(see [4]) After \(O(n)\)time and space preprocessing the first position where two strings \(u_{i,r}(S_j)\) and \(u_{i,r}(S_{j'})\) differ can be found in \(O(n \log n / m)\) time correctly with error probability at most \(1/n^3\).
Proof
For \(m=O(\log n)\) the conclusion is trivial. Assume otherwise. We start by building the suffix tree for the string T which takes \(O(n)\) time and space [17, 36]. Furthermore, we augment the suffix tree with an LCA data structure in \(O(n)\) time [12, 19].
Let \(\ell = \lceil 3 n \ln n / m \rceil \). We can find the first \(\ell \) positions \(q_1< q_2< \cdots < q_\ell \) where \(S_{j}\) and \(S_{j'}\) differ in \(O(\ell )=O(n \log n / m)\) time using the kangaroo method [15, 28]. We set \(q_r=\infty \) if a given position does not exist. The idea of the kangaroo method is as follows. We can find \(q_1\) by one query to the LCA data structure in \(O(1)\) time. After removing the first \(q_1\) positions of \(S_{j}\) and \(S_{j'}\), we obtain suffixes \(S_{j+q_1}, S_{j'+q_1}\) and find \(q_2\) by another query to the LCA data structure, and so on. If at least one of the positions \(q_1, q_2, \ldots , q_\ell \) belongs to \(\mathsf {P}\), then we return the first such position as an answer, and otherwise we say that \(u_{i,r} (S_j) = u_{i,r} (S_{j'})\). The multiset \(\mathsf {P}\) can be stored as an array of multiplicities so that testing if an element belongs to it can be done in constant time.
Let us show that if p is the first position where \(u_{i,r}(S_j)\) and \(u_{i,r}(S_{j'})\) differ, then p belongs to \(\{q_1, q_2, \ldots , q_\ell \}\) with high probability. Because \(q_1< q_2< \cdots < q_\ell \) are the first \(\ell \) positions where \(S_j\) and \(S_{j'}\) differ, it suffices to show that at least one of these positions belongs to \(\mathsf {P}\). We rely on the fact that positions of \(\mathsf {P}\) are independent and uniformly random elements of [1, n]. Consequently, we have \(\Pr [q_1, \ldots , q_\ell \notin \mathsf {P}] = (1  \ell /n)^{m} \le (1  3 \ln n/m)^{m} \le \frac{1}{e^{3\ln n}} = 1/n^3\). \(\square \)
By Lemmas 3.14 and 3.15, the trie on strings \(u_{i,r} (S_1), \dots , u_{i,r} (S_{2n})\) can be built in \(O(\min \{\sqrt{m}, n \log n / m \} \cdot n \log n) = O(n^{4/3} \log ^{4/3} n)\) time and \(O(n)\) space correctly with high probability which implies Theorem 3.5 as explained in the beginning of this section.
4 Approximate LCS with \({{k}}\) Mismatches
In this section, we consider an approximate variant of the LCS with \({{k}}\) Mismatches problem, defined as follows.
Problem 4.1
(Approximate LCS with \({{k}}\) Mismatches) Two strings \(T_1, T_2\) of length n, an integer k, and a constant \(z > 1\) are given. If \(\ell _k\) is the length of the longest common substring with k mismatches of \(T_1\) and \(T_2\), return a substring of \(T_1\) of length at least \(\ell _k/z\) that occurs in \(T_2\) with at most k mismatches.
Theorem 4.2

(a)
The Approximate LCS with \({{k}}\) Mismatches problem for \(z=2\) can be solved in \(O(n^{1.5}\log ^2 n)\) time and \(O(n^{1.5})\) space.

(b)
Suppose there exist \(0< \varepsilon < 1\) and \(\delta >0\) such that the Approximate LCS with \({{k}}\) Mismatches problem for \(z=2\varepsilon \) and a binary alphabet can be solved in \(O(n^{2\delta })\) time. Then SETH is false.
Proof
(a) The algorithm of Theorem 3.1 for \(\varepsilon =1\) computes a pair of substrings of length at least \(\ell _k\) of \(T_1\) and \(T_2\) that have Hamming distance at most 2k. Either the first halves or the second halves of the strings have Hamming distance at most k.
(b) We use the gap that exists in Lemma 2.5 for \(q>1\). Assume that there is such an algorithm for some \(\varepsilon \) and \(\delta \). We will run it for strings \(T_1\) and \(T_2\) from that lemma. Let \(q = \lceil \frac{3}{\varepsilon }\rceil 2\); then \(\ell /\ell ' \ge 2\varepsilon \). If the Orthogonal Vectors problem has a solution, by Lemma 2.5(a), the algorithm produces a longest common substring of length at least \(\ell /(2\varepsilon ) \ge \ell '\). Otherwise, by Lemma 2.5(b), its result has length smaller than \(\ell '\). This concludes that the conjectured approximation algorithm can be used to solve the Orthogonal Vectors problem.
The lengths of the selected strings are \(n=N(7dq+7d)+7dq=O(Nd)\) for \(d=c \log N\). Hence, the running time is \(O(n^{2\delta })=O(N^{2\delta }d^{O(1)})\), which, by Fact 2.2, contradicts SETH. \(\square \)
5 LCS with k Mismatches for all k
The following problem has received a considerable attention in the recent years; see [9] and the references therein.
Problem 5.1
(Binary Jumbled Indexing) Construct a data structure over a binary string S of length n that, given positive integers \(\ell \) and q, can compute if there is a substring of S of length \(\ell \) containing exactly q ones.
A simple combinatorial argument shows that it suffices to compute the minimal and maximal number of ones in a substring of S of length \(\ell \), as for every intermediate number of ones a substring of S of this length exists as well. As a result, the Binary Jumbled Indexing problem can be solved in linear space and with constanttime queries. It turns out that the index can also be constructed in strongly subquadratic time.
Lemma 5.2
(Chan and Lewenstein [9]) The index for Binary Jumbled Indexing of \(O(n)\) size and with \(O(1)\)time queries can be constructed in \(O(n^{1.859})\) expected time or in \(O(n^{1.864})\) deterministic time.
We use this result to solve the LCS with \({{k}}\) Mismatches problem for all values of k simultaneously.
Theorem 5.3
LCS with \({{k}}\) Mismatches for all \({\textsf {k}}\) can be solved in \(O(n^{2.859})\) expected time or in \(O(n^{2.864})\) deterministic time.
Proof
Note that, equivalently, we can compute, for all \(\ell =1,\ldots ,n\), what is the minimal Hamming distance between substrings of length \(\ell \) in \(T_1\) and \(T_2\).
Let M be an \(n \times n\) Boolean matrix such that \(M[i,j]=0\) if and only if \(T_1[i]=T_2[j]\). We construct \(2n1\) binary strings corresponding to the diagonals of M: the string number p, for \(p \in \{n,\ldots ,n\}\), corresponds to the diagonal \(M[i,j]\,:\,ji=p\). For each of the strings, we construct the jumbled index using Lemma 5.2.
Each diagonal corresponds to one of the possible alignments of \(T_1\) and \(T_2\). In the jumbled index we compute, in particular, for each value of \(\ell \) what is the minimal number of 1s (which correspond to mismatches between the corresponding positions in \(T_1\) and \(T_2\)) in a string of length \(\ell \). To compute the global minimum for a given \(\ell \), we only need to take the minimum across all the jumbled indexes.
By Lemma 5.2, all the jumbled indexes can be constructed in \(O(n^{2.859})\) expected time or in \(O(n^{2.864})\) time deterministically. \(\square \)
Change history
19 March 2019
A Correction to this paper has been published: https://doi.org/10.1007/s00453019005601
10 April 2023
A Correction to this paper has been published: https://doi.org/10.1007/s0045302301119x
Notes
Recall that a compact trie stores only explicit nodes, that is, the root, the leaves, and nodes with at least two children. Its size is linear in the number of strings that are stored. Henceforth we call a compact trie simply a trie.
References
Abboud, A., Williams, R.R., Yu, H.: More applications of the polynomial method to algorithm design. In: Indyk P. (ed.) 26th Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2015, pp. 218–230. SIAM (2015). https://doi.org/10.1137/1.9781611973730.17
Agrawal, M., Kayal, N., Saxena, N.: PRIMES is in P. Ann. Math. 160(2), 781–793 (2004). https://doi.org/10.4007/annals.2004.160.781
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990). https://doi.org/10.1016/s00222836(05)803602
Andoni, A., Indyk, P.: Efficient algorithms for substring near neighbor problem. In: 17th Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2006, pp. 1203–1212. SIAM (2006). https://doi.org/10.1145/1109557.1109690
Babenko, M.A., Starikovskaya, T.: Computing longest common substrings via suffix arrays. In: Hirsch, E.A., Razborov, A.A., Semenov, A.L., Slissenko, A. (eds.) Computer Science Symposium in Russia, CSR 2008, LNCS, vol. 5010, pp. 64–75. Springer (2008). https://doi.org/10.1007/9783540797098_10
Babenko, M.A., Starikovskaya, T.: Computing the longest common substring with one mismatch. Probl. Inf. Transm. 47(1), 28–33 (2011). https://doi.org/10.1134/S0032946011010030
Bille, P., Gørtz, I.L., Kristensen, J.: Longest common extensions via fingerprinting. In: Dediu, A., MartínVide, C. (eds.) Language and Automata Theory and Applications, LATA 2012, LNCS, vol. 7183, pp. 119–130. Springer (2012). https://doi.org/10.1007/9783642283321_11
Bille, P., Gørtz, I.L., Sach, B., Vildhøj, H.W.: Timespace tradeoffs for longest common extensions. J. Discrete Algorithms 25, 42–50 (2014). https://doi.org/10.1016/j.jda.2013.06.003
Chan, T.M., Lewenstein, M.: Clustered integer 3SUM via additive combinatorics. In: Servedio, R.A., Rubinfeld, R. (eds.) 47th Annual ACM Symposium on Theory of Computing, STOC 2015, pp. 31–40. ACM (2015). https://doi.org/10.1145/2746539.2746568
Charalampopoulos, P., Crochemore, M., Iliopoulos, C.S., Kociumaka, T., Pissis, S.P., Radoszewski, J., Rytter, W., Waleń, T.: Lineartime algorithm for long LCF with \(k\) mismatches. In: Navarro, G., Sankoff, D., Zhu, B. (eds.) Combinatorial Pattern Matching, CPM 2018, LIPIcs, vol. 105, pp. 23:1–23:16. Schloss Dagstuhl–LeibnizZentrum für Informatik (2018). https://doi.org/10.4230/LIPIcs.CPM.2018.23
Cygan, M., Fomin, F.V., Kowalik, Ł., Lokshtanov, D., Marx, D., Pilipczuk, M., Pilipczuk, M., Saurabh, S.: Parameterized Algorithms. Springer (2015). https://doi.org/10.1007/9783319212753
Fischer, J., Heun, V.: Spaceefficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011). https://doi.org/10.1137/090779759
Fischer, M.J., Paterson, M.S.: String matching and other products. In: Karp, R.M. (ed.) Complexity of Computation, SIAMAMS Proceedings, vol. 7, pp. 113–125. AMS, Providence, RI (1974)
Flouri, T., Giaquinta, E., Kobert, K., Ukkonen, E.: Longest common substrings with \(k\) mismatches. Inf. Process. Lett. 115(6–8), 643–647 (2015). https://doi.org/10.1016/j.ipl.2015.03.006
Galil, Z., Giancarlo, R.: Parallel string matching with \(k\) mismatches. Theor. Comput. Sci. 51, 341–348 (1987). https://doi.org/10.1016/03043975(87)900429
Grabowski, S.: A note on the longest common substring with \(k\)mismatches problem. Inf. Process. Lett. 115(6–8), 640–642 (2015). https://doi.org/10.1016/j.ipl.2015.03.003
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997). https://doi.org/10.1017/cbo9780511574931
HarPeled, S., Indyk, P., Motwani, R.: Approximate nearest neighbor: towards removing the curse of dimensionality. Theory Comput. 8(1), 321–350 (2012). https://doi.org/10.4086/toc.2012.v008a014
Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13(2), 338–355 (1984). https://doi.org/10.1137/0213024
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963). https://doi.org/10.1080/01621459.1963.10500830
Hui, L.C.K.: Color set size problem with application to string matching. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) Combinatorial Pattern Matching, CPM 1992, LNCS, vol. 644, pp. 230–243. Springer (1992). https://doi.org/10.1007/3540560246_19
Ilie, L., Navarro, G., Tinta, L.: The longest common extension problem revisited and applications to approximate string searching. J. Discrete Algorithms 8(4), 418–428 (2010). https://doi.org/10.1016/j.jda.2010.08.004
Impagliazzo, R., Paturi, R.: On the complexity of \(k\)SAT. J. Comput. Syst. Sci. 62(2), 367–375 (2001). https://doi.org/10.1006/jcss.2000.1727
Impagliazzo, R., Paturi, R., Zane, F.: Which problems have strongly exponential complexity? J. Comput. Syst. Sci. 63(4), 512–530 (2001). https://doi.org/10.1006/jcss.2001.1774
Karp, R.M., Rabin, M.O.: Efficient randomized patternmatching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987). https://doi.org/10.1147/rd.312.0249
Kociumaka, T., Starikovskaya, T., Vildhøj, H.W.: Sublinear space algorithms for the longest common substring problem. In: Schulz, A.S., Wagner, D. (eds.) Algorithms, ESA 2014, LNCS, vol. 8737, pp. 605–617. Springer (2014). https://doi.org/10.1007/9783662447772_50
Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30(2), 457–474 (2000). https://doi.org/10.1137/S0097539798347177
Landau, G.M., Vishkin, U.: Efficient string matching with \(k\) mismatches. Theor. Comput. Sci. 43, 239–249 (1986). https://doi.org/10.1016/03043975(86)901787
Leimeister, C., Morgenstern, B.: kmacs: the \(k\)mismatch average common substring approach to alignmentfree sequence comparison. Bioinformatics 30(14), 2000–2008 (2014). https://doi.org/10.1093/bioinformatics/btu331
Porat, B., Porat, E.: Exact and approximate pattern matching in the streaming model. In: 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, pp. 315–323. IEEE Computer Society (2009). https://doi.org/10.1109/FOCS.2009.11
Starikovskaya, T.: Longest common substring with approximately \(k\) mismatches. In: Grossi, R., Lewenstein, M. (eds.) Combinatorial Pattern Matching, CPM 2016, LIPIcs, vol. 54, pp. 21:1–21:11. Schloss Dagstuhl–LeibnizZentrum für Informatik (2016). https://doi.org/10.4230/LIPIcs.CPM.2016.21
Starikovskaya, T., Vildhøj, H.W.: Timespace tradeoffs for the longest common substring problem. In: Fischer, J., Sanders, P., (eds.) Combinatorial Pattern Matching, CPM 2013, LNCS, vol. 7922, pp. 223–234. Springer (2013). https://doi.org/10.1007/9783642389054_22
Tao, T., Croot III, E., Helfgott, H.: Deterministic methods to find primes. Math. Comput. 81(278), 1233–1246 (2012). https://doi.org/10.1090/S002557182011025421
Thankachan, S.V., Aluru, C., Chockalingam, S.P., Aluru, S.: Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In: Raphael, B.J. (ed.) Research in Computational Molecular Biology, RECOMB 2018, LNCS, vol. 10812, pp. 211–224. Springer (2018). https://doi.org/10.1007/9783319899299_14
Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the kmismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016). https://doi.org/10.1089/cmb.2015.0235
Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, SWAT 1973, pp. 1–11. IEEE Computer Society, Washington, DC, USA (1973). https://doi.org/10.1109/SWAT.1973.13
Williams, R.: A new algorithm for optimal 2constraint satisfaction and its implications. Theor. Comput. Sci. 348(2–3), 357–365 (2005). https://doi.org/10.1016/j.tcs.2005.09.023
Acknowledgements
Jakub Radoszewski was supported by the “Algorithms for text processing with errors and uncertainties” project carried out within the HOMING programme of the Foundation for Polish Science cofinanced by the European Union under the European Regional Development Fund. Funding was provided by Foundation for Polish Science (Grant No. Homing/20162/16).
Funding
Open access funding approved by Polish ICM agreement.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This is a full and extended version of the conference paper [31].
The original online version of this article was revised: The article is approved under and paid for open access by the Polish ICM Agreement in 2019, but was not published open access. The article is retroactively changed to open access.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kociumaka, T., Radoszewski, J. & Starikovskaya, T. Longest Common Substring with Approximately k Mismatches. Algorithmica 81, 2633–2652 (2019). https://doi.org/10.1007/s0045301900548x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s0045301900548x