1 Introduction

The k-mappability problem Analyzing data derived from massively parallel sequencing experiments often depends on the process of genome assembly via resequencing; namely, assembly with the help of a reference sequence. In this process, a large number of reads (or short sequences) derived from a DNA donor during these experiments must be mapped back to a reference sequence, comprising a few gigabases, to establish the section of the genome from which each read has been derived. An extensive number of short-read alignment techniques and tools have been introduced to address this challenge emphasizing on different aspects of the process [16].

In turn, the process of resequencing depends heavily on how mappable a genome is with respect to reads of some fixed length m. Thus, given a reference sequence, for every substring of length m in the sequence, we want to count how many additional times this substring appears in the sequence when allowing for a small number k of errors. This computational problem and a heuristic approach to approximate the solution were first proposed in [12] (see also [5]), where a great variance in genome mappability between species and gene classes was revealed.

More formally, for a string T, let \(T_i^m\) denote the length-m substring of T that starts at position i. In the (km)-mappability problem, for a given string T of length n, we are asked to compute a table \({\text {A}}^m_{\le k}\) whose ith entry \({\text {A}}^m_{\le k}[i]\) is the number of indices \(j \ne i\) such that the substrings \(T_i^m\) and \(T_j^m\) are at Hamming distance at most k. In the previous study [12], the assumed values of parameters were \(k \le 4\), \(m \le 100\), and the alphabet of T was \(\{\mathtt {A},\mathtt {C},\mathtt {G},\mathtt {T}\}\).

Example 1.1

Consider a string \(T=\texttt {aababba}\) and \(m=3\). The following table shows the (km)-mappability counts for \(k=1\) and \(k=2\).

position

i

1

2

3

4

5

substring

\(T_i^3\)

\(\texttt {aab}\)

\(\texttt {aba}\)

\(\texttt {bab}\)

\(\texttt {abb}\)

\(\texttt {bba}\)

(1, 3)-mappability

\({\text {A}}_{\le 1}^3[i]\)

2

2

1

2

1

(2, 3)-mappability

\({\text {A}}_{\le 2}^3[i]\)

3

3

3

4

3

difference

\({\text {A}}_{=2}^3[i]\)

1

1

2

2

2

For instance, consider the position 1. The (1, 3)-mappability is 2 due to the occurrences of \(\texttt {bab}\) and \(\texttt {abb}\) at positions 3 and 4, respectively. The (2, 3)-mappability is 3 since only the substring \(\texttt {bba}\), occurring at position 5, has three mismatches with \(\texttt {aab}\).

For convenience, our algorithms compute an array \({\text {A}}_{=k}^m\) whose ith entry \({\text {A}}_{=k}^m[i]\) is the number of positions \(j\ne i\) such that substrings \(T_i^m\) and \(T_j^m\) are at Hamming distance exactly k. Note that \({\text {A}}_{\le k}^m[i]=\sum _{\kappa =0}^{k} {\text {A}}_{=\kappa }^m[i]\); see the “difference” row in the example above. Henceforth, we call this problem the (k, m)-mappability problem.

Table 1 Known algorithms for computing (1, m)-mappability for strings over constant-sized alphabets

Using the suffix array and the LCP table [24, 26, 30], the (0, m)-mappability problem can be solved in \(O(n)\) time and space. Known solutions for computing (1, m)-mappability are shown in Table 1; the \(O(nm)\)-time and the \(O(n)\)-average-time solutions of Alzamel et al. [3] work also on strings over integer alphabets \(\{1,\dots ,\sigma \}\) for \(\sigma = n^{O(1)}\). Moreover, the latter algorithm was shown to be generalizable to arbitrary k, requiring \(O(n)\) space and, on average, \(O(kn)\) time if \(m=\Omega (k\log _\sigma n)\). A practically fast algorithm for arbitrary k was presented in [32]. In [1], the authors introduced an efficient construction of a genome mappability array \(B_k\) in which \(B_k[\mu ]\) is the smallest length m such that at least \(\mu \) of the length-m substrings of T do not occur elsewhere in T with at most k mismatches. This construction was further improved in [6].

The all-pairs Hamming distance problem The evolutionary relationships between different species or taxa are usually inferred through phylogenetic analysis techniques. Some of these techniques rely on the inference of phylogenetic trees. A first step of these techniques is to compute the distances between all pairs of sequences representing the set of species or taxa under study [35]. This particular step, however, often dominates the running time of these methods. Depending on the application, the underlying model of evolution, and the optimality criterion, it may not be strictly necessary to be aware of the complete distance matrix (see [11, 17], for instance). Thus, in this preprocessing step, we are only interested in pairs with distances not exceeding a given threshold.

The computational problem can be formally defined as follows. Given a set \(\mathbf {R}\) of r length-m strings and an integer \(k\in \{0,\ldots , m\}\), return all pairs \((X_1,X_2)\in \mathbf {R}\times \mathbf {R}\), with \(X_1\ne X_2\), such that \(X_1\) and \(X_2\) are at Hamming distance at most k. This problem has been studied in the average-case model and efficient linear-time algorithms are known under some constraints on the value of k and some assumptions on the elements of \(\mathbf {R}\) [11, 20, 29]. In particular, these algorithms work in O(rm) average-case time if \(k<\frac{(m-k-1)\log \sigma }{\log rm}\) and the elements of \(\mathbf {R}\) are over an integer alphabet \(\Sigma \) of size \(\sigma >1\) with the letters of the strings being independent and identically distributed random variables uniformly distributed over \(\Sigma \). The indexing variant of the all-pairs Hamming distance problem has further applications in bioinformatics for querying typing databases [8] and in information retrieval for searching similar documents in a collection [19].

Intuitively, there is a connection between the (km)-mappability problem and the all-pairs Hamming distance problem that allows to transfer the technique used in the solution to the former to a solution to the latter (it is not a formal reduction between problems). The connection is as follows: by first concatenating the r elements of \(\mathbf {R}\) to construct a new string T of length \(n=rm\), solving the former considering only the r substrings of T starting at positions i with \(i \bmod m=1\), and summing up the resulting values, we would obtain the total size of the output of the latter.

Henceforth, we assume, as in the mappability problem, that we are to compute all pairs at Hamming distance exactly k. In the end, we run the algorithm for all values of k up to a given threshold of interest.

Our contributions. We present several algorithms for the general case of the (km)-mappability problem. More specifically, our contributions are as follows:

  1. 1.

    In Sect. 3, we show a randomized Las-Vegas algorithm for the (km)-mappability problem that works in \(O(n\left( {\begin{array}{c}\log n+k\\ k\end{array}}\right) 4^kk)\) time with high probabilityFootnote 1 and \(O(n2^kk)\) space for a string over any ordered alphabet. It requires a careful adaptation of the technique of recursive heavy-path decompositions in a tree [10].

  2. 2.

    In Sect. 4, we show an algorithm to solve the all-pairs Hamming distance problem for strings over any ordered alphabet that works in \(O(rm+ r\left( {\begin{array}{c}\log r+k\\ k\end{array}}\right) 4^kk \log r + \mathsf {output}\cdot 2^kk \log r)\) time and \(O(rm+r2^kk \log r)\) space.

  3. 3.

    In Sect. 5, we show an algorithm for the (km)-mappability problem that works in \(O(n k \cdot (m+1)^k)\) time and \(O(n)\) space for a string over an integer alphabet. Together with the first result, this yields an \(O(n \cdot \min \{m^k,\log ^k n\})\)-time and \(O(n)\)-space algorithm for \(k=O(1)\).

  4. 4.

    In Sect. 6, we show \(O(n^2)\)-time algorithms for a string over any ordered alphabet to compute all (km)-mappability tables for a fixed m and all \(k\in \{0,\ldots ,m\}\), or for a fixed k and all \(m\in \{k,\ldots ,n\}\).

  5. 5.

    Finally, in Sect. 7, we prove that the (km)-mappability problem for \(k,m = \Theta (\log n)\) cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis [22, 23] fails.

In contributions 1 and 5, we apply recent advances in the Longest Common Substring with k Mismatches problem that were presented in [9, 27], respectively (see also [34]). In particular, compared to [9], our contribution 1 requires a careful counting of substring pairs to avoid multiple counting and a thorough analysis of the space usage. Technically this is the most involved contribution. Contributions 1, 2, and 4 apply to strings over an arbitrary ordered alphabet; the running times of the respective algorithms are \(\Omega (n \log n)\), which is sufficient to renumber the letters of the input text so that its alphabet becomes an integer alphabet.

This work is an extended version of [2]. In comparison to the conference version, in particular, we improve the complexity of the main algorithm by a \(\Theta (\log n)\)-factor, remove the dependency on the alphabet size in contribution 3, and apply our techniques to solve the all-pairs Hamming distance problem (contribution 2).

2 Preliminaries

Let \(T=T[1]T[2]\cdots T[n]\) be a string of length \(|T|=n\) over a finite ordered alphabet \(\Sigma \) of size \(|\Sigma |=\sigma \). The empty string is denoted by \(\varepsilon \). In some algorithms we assume that the string is over an integer alphabet, i.e., \(\Sigma =\{1,\dots ,n^{O(1)}\}\). For two positions i and j on T, the substring (sometimes called factor) of T that starts at position i and ends at position j is \(T[i]\cdots T[j]\) (it is of length 0 if \(j<i\)). A prefix of T is a substring that starts at position 1 and a suffix of T is a substring that ends at position n. We denote the suffix that starts at position i by \(T_i\) and its prefix of length m by \(T_i^m\).

The Hamming distance between two strings S and T of the same length \(|S| = |T|\) is defined as \(d_H(S, T) = |\{i\in \{1, 2,\ldots , |S|\} : S[i] \ne T[i]\}|\). If \(|S| \ne |T|\), we set \(d_H(S, T)=\infty \).

By \(\textsf {lcp}(U,V)\) we denote the length of the longest common prefix of strings U and V. For a fixed string T, we also set \(\textsf {lcp}(r, s)=\textsf {lcp}(T_{r},T_{s})\).

Compact trie. A trie of a collection of strings C is a labeled tree that contains a node for every distinct prefix of a string in C; the root node is \(\varepsilon \); the set of terminal nodes is C; and edges are of the form \(u{\mathop {\rightarrow }\limits ^{c}}uc\), where u and uc are nodes and \(c\in \Sigma \). A compact trie \(\mathbf {T}\) of a collection of strings C is obtained from the trie of C by dissolving all non-branching nodes, excluding the root and the terminals. The nodes of the trie which become nodes of \(\mathbf {T}\) are called explicit nodes, whereas the other nodes are called implicit. Each edge of \(\mathbf {T}\) can be viewed as an upward maximal path of implicit nodes starting with an explicit node. The string label of an edge is a substring of one of the strings in C; the label of an edge is the first letter of the edge’s string label. Each node of the trie can be represented in \(\mathbf {T}\) by the edge it belongs to and an index within the corresponding path. We let \(\mathbf {L}(v)\) denote the path-label of a node v, i.e., the concatenation of the string labels of the edges along the path from the root to v. Additionally, \(\mathbf {D}(v)= |\mathbf {L}(v)|\) is the string-depth of node v.

Suffix tree. The suffix tree of a string T is the compact trie representing all suffixes of T. The suffix tree of a string T of length n over an integer alphabet can be constructed in \(O(n)\) time [14] and, after an \(O(n)\)-time preprocessing [7], it can be used to answer \(\textsf {lcp}(r,s)\) queries in \(O(1)\) time.

Hashing. We use perfect hashing to implement dynamic dictionaries supporting insertions and deletions of entries (key-value pairs), as well as look-ups of entries with a given key. Technically, we maintain a single global dictionary (which may simulate multiple local dictionaries) implemented using the following result originating from the work of Dietzfelbinger and Meyer auf der Heide [13].

Theorem 2.1

(see [13, Theorem 5.5]) For any constant \(c>1\) and positive integer n, there is a data structure that maintains a dynamic dictionary \(\mathbf {D}\) of size \(|\mathbf {D}|\le n\) with the following guarantees:

  1. 1.

    The data structure occupies \(O(n)\) space.

  2. 1.

    Handling any \(m\le n^c\) operations (look-ups, insertions, and deletions) in an on-line fashion costs \(O(n+m)\) time in totalFootnote 2 with probability at least \(1-n^{-c}\).

The constants in the time and space bounds depend on c.

The original data structure of [13, Theorem5.5] supports n operations in \(O(n)\) time with probability at least \(1-n^{-c}\). To allow \(m\le n^c\) operations, we use instances supporting 2n operations in \(O(n)\) time with probability at least \(1-n^{1-2c}\), and we build such an instance from scratch after completing every n operations (using \(|\mathbf {D}|\le n\) insertions out of the allowance of 2n operations). By the union bound, all \(m\le n^c\) operations are thus handled in \(O(n+m)\) total time with probability at least \(1-n^{-c}\).

When using strings as dictionary keys, we rely on Karp–Rabin fingerprints (polynomial hashing) [25] with collision probability bounded by \(n^{-C}\) for strings of length at most n (and a sufficiently large constant C). In order to obtain Las-Vegas algorithms, we provide mechanisms for detecting collisions and resort to naive polynomial-time solutions upon detecting any.

3 Computing Mappability in \(O(n \log ^k n)\) Time and \(O(n)\) Space

Our algorithm operates on so-called modified strings. A modified string \(\alpha \) is a pair \((U(\alpha ),M(\alpha ))\), where \(U(\alpha )\) is a string and \(M(\alpha )\) a set of modifications. Each element of the set \(M(\alpha )\) is a pair of the form (ic) which denotes a substitution “\(U(\alpha )[i]:=c\)”. We assume that no two pairs in \(M(\alpha )\) share the same index i. By \( val (\alpha )\), we denote the string \(U(\alpha )\) after all the substitutions. The sets \(M(\alpha )\) for modified strings are implemented as (functional) lists. Whenever a modified string \(\beta \) is obtained by introducing an extra modification to a modified string \(\alpha \), the head of \(M(\beta )\) represents the new modification whereas the tail points to \(M(\alpha )\). We always introduce modifications in the left-to-right order so that the lists \(M(\alpha )\) are sorted according to the decreasing order of indices i.

The algorithm processes modified substrings of T that are modified strings originating from the substrings \(T_i^m\). In this case, the strings \(U(\alpha )\) are not stored explicitly. Instead, for a modified substring \(\alpha \) originating from \(T_i^m\), an index \( idx (\alpha )=i\) is stored.

Overview of the algorithm Intuitively, the algorithm proceeds by efficiently simulating transformations of a compact trie of modified substrings, initially containing all substrings \(T_i^m\).Footnote 3 The elementary transformations are guided by the smaller-to-larger principle, and each of them consists in copying one subtree unto its sibling, with an appropriate modification introduced to each copied substring in order to match the label of the edge leading to the sibling. This process effectively results in registering one mismatch for a large batch of substrings at once, and therefore lays a foundation to solve the main problem in the aforementioned time.

The trie is constructed top-down recursively, and the final set of modified substrings that are present in the trie is known only when all the leaves of the trie have been reached.

A node v of the trie stores a set of modified substrings \(\textit{MS}(v)\). Initially, the root r stores all substrings \(T_i^m\) in its set \(\textit{MS}(r)\). The path-label \(\mathbf {L}(v)\) is the longest common prefix of (the values of) all the modified substrings in \(\textit{MS}(v)\) and the string-depth \(\mathbf {D}(v)\) is the length of this prefix. None of the strings in \(\textit{MS}(v)\) contains a modification at a position greater than \(\mathbf {D}(v)\). The children of v are determined by subsets of \(\textit{MS}(v)\) that correspond to different letters at position \(\mathbf {D}(v)+1\). Furthermore, additional modified substrings with modifications at position \(\mathbf {D}(v)+1\) are created and inserted into the children’s \(\textit{MS}\)-sets. This corresponds to the intuition of copying subtrees unto their siblings; see Fig. 1.

Fig. 1
figure 1

To the left: a trie of all length-3 substrings of aabbab. To the right: an effect of copying the right subtree unto the left subtree, which corresponds to changing the first letter of all its substrings from b to a. In the original trie, there was exactly one pair of substrings from different subtrees of the root at Hamming distance 1; after the operation, there is a leaf containing modified substrings corresponding to these substrings. Such copy operations are performed in our algorithm top-down in the trie, making sure that each resulting modified substring has at most k modifications

The goal is to propagate the modified substrings to the leaves and, by processing each leaf independently, register exactly once every pair of substrings \((T_i^m, T_j^m)\) differing on exactly k positions.

Now, we will describe the recursive routine for visiting a node.

Processing an internal node Assume that our node v has children \(u_1,\ldots ,u_a\). First, we distinguish a child of v with maximum-size set \(\textit{MS}\), with ties being broken arbitrarily; let it be \(u_1\). We will refer to this child as heavy and to every other as light. We will recursively branch into each child to take care of all pairs of modified substrings contained in any single subtree.

For this, we create an extra child \(u_{a+1}\) so that \(\textit{MS}(u_{a+1})\) contains all modified substrings from \(\textit{MS}(u_2) \cup \dots \cup \textit{MS}(u_a)\) with the letters at position \(\mathbf {D}(v)+1\) replaced by a common wildcard character $. By processing the subtree of \(u_{a+1}\), we will consider pairs of modified substrings that originate from different light children.

Additionally, we insert all modified substrings from \(\textit{MS}(u_2) \cup \dots \cup \textit{MS}(u_a)\) into \(\textit{MS}(u_1)\), substituting the letter at position \(\mathbf {D}(v)+1\) with the common letter at this position of modified substrings in \(\textit{MS}(u_1)\). This transformation will take care of pairs between the heavy child and the light ones.

As modified substrings with more than k substitutions are irrelevant for our algorithm, we refrain from creating them in the interest of time and space complexity.

Finally, the algorithm branches into the subtrees of \(u_1,\ldots ,u_{a+1}\). A pseudocode of this process is presented as Algorithm 1.

Let us note that, in the special case of a binary alphabet, the child \(u_{a+1}\) need not be created. Indeed, in this case, each node has at most two children, hence at most one light one, whereas when processing the subtree of \(u_{a+1}\), we consider pairs of modified substrings that originate from different light children.

figure a

Processing a leaf Each modified substring \(\alpha \) stores its index of origin \( idx (\alpha )\) and the set of modifications \(M(\alpha )\). As we have seen, the substitutions introduced in the recursion are of two types: of wildcard origin and of heavy origin. For a modified substring \(\alpha \), we introduce a partition \(M(\alpha )=W(\alpha )\cup H(\alpha )\) into modifications of these kinds. For every leaf v, the modified substrings \(\alpha \in \textit{MS}(v)\) share the same value \( val (\alpha )\), and hence \(W(\alpha )\) is also the same. Finally, by \(W^{-1}(\alpha )\) we denote the set \(\{(j,T_{ idx (\alpha )}^m[j]) : (j,\$)\in W(\alpha )\}\). We call modified substrings \(\alpha ,\beta \in \textit{MS}(v)\) compatible if they satisfy the following condition:

$$\begin{aligned} H(\alpha ) \cap H(\beta ) = \emptyset ,\;\;\; W^{-1}(\alpha ) \cap W^{-1}(\beta ) = \emptyset , \;\;\; |H(\alpha )|+|H(\beta )|+|W(\alpha )|=k.\nonumber \\ \end{aligned}$$
(3.1)

Lemma 3.3 below shows that if two modified substrings are compatible, then the original substrings were at Hamming distance at most k. Intuitively, \(\alpha \) and \(\beta \) are compatible only if the positions of modifications in \(M(\alpha )\cup M(\beta )\) do not contain any position j such that \(T^m_{ idx (\alpha )}[j]=T^m_{ idx (\beta )}[j]\).

Example 3.1

Let us consider modified strings \(\alpha _1,\dots ,\alpha _6\) with the following original strings and sets of modifications such that \( val (\alpha _i)=\mathtt {aba\$c}\) for all \(i=1,\dots ,6\).

i

\(U(\alpha _i)\)

\(H(\alpha _i)\)

\(W(\alpha _i)\)

\(W^{-1}(\alpha _i)\)

\(d_H(U(\alpha _1),U(\alpha _i))\)

1

\(\mathtt {aaabc}\)

\(\{(2,\mathtt {b})\}\)

\(\{(4,\$)\}\)

\(\{(4,\mathtt {b})\}\)

0

2

\(\mathtt {bbacc}\)

\(\{(1,\mathtt {a})\}\)

\(\{(4,\$)\}\)

\(\{(4,\mathtt {c})\}\)

3

3

\(\mathtt {abbcb}\)

\(\{(3,\mathtt {a}),(5,\mathtt {c})\}\)

\(\{(4,\$)\}\)

\(\{(4,\mathtt {c})\}\)

4

4

\(\mathtt {abacc}\)

\(\emptyset \)

\(\{(4,\$)\}\)

\(\{(4,\mathtt {c})\}\)

2

5

\(\mathtt {acacc}\)

\(\{(2,\mathtt {b})\}\)

\(\{(4,\$)\}\)

\(\{(4,\mathtt {c})\}\)

2

6

\(\mathtt {ababa}\)

\(\{(5,\mathtt {c})\}\)

\(\{(4,\$)\}\)

\(\{(4,\mathtt {b})\}\)

2

Let us notice that \(W(\alpha _i)\) is the same for all i. Let \(k=3\). The only modified string that is compatible with \(\alpha _1\) is \(\alpha _2\). Each of the remaining modified strings violates exactly one of the conditions from Eq. 3.1: \(|H(\alpha _1)|+|H(\alpha _3)|+|W(\alpha _1)|=4\), \(|H(\alpha _1)|+|H(\alpha _4)|+|W(\alpha _1)|=2\), \(H(\alpha _1) \cap H(\alpha _5) = \{(2,\mathtt {b})\}\), \(W^{-1}(\alpha _1) \cap W^{-1}(\alpha _6) = \{(4,\mathtt {b})\}\). Indeed, we have \(d_H(U(\alpha _1),U(\alpha _2))=3\) and \(d_H(U(\alpha _1),U(\alpha _i)) \ne 3\) for \(i\in \{3,\dots ,6\}\).

As proved in Lemma 3.8 below, for every \(\alpha \in \textit{MS}(v)\), we should increment \(A_{=k}^m[ idx (\alpha )]\) for each compatible \(\beta \in \textit{MS}(v)\). We next show how to efficiently count these modified substrings using the inclusion–exclusion principle and several precomputed values, as we cannot afford to count them naively.

For convenience, let \(R(\alpha )\) denote the union of disjoint sets \(H(\alpha )\) and \(W^{-1}(\alpha )\). For a leaf v, let \( Count (s,B)\) denote the number of modified substrings \(\beta \in \textit{MS}(v)\) such that \(|H(\beta )|=s\) and \(B \subseteq R(\beta )\). All the non-zero values \( Count (\cdot ,\cdot )\) are stored in a hash table. They can be generated by iterating through all the subsets of \(R(\beta )\) for all modified substrings \(\beta \in \textit{MS}(v)\); this costs \(O(2^kk|\textit{MS}(v)|)\) time and space. Finally, the result for a modified substring \(\alpha \) can be computed using the following direct consequence of the inclusion–exclusion principle.

Lemma 3.2

The number of modified substrings \(\beta \in \textit{MS}(v)\) that are compatible with a modified substring \(\alpha \in \textit{MS}(v)\) is \(\sum _{B \subseteq R(\alpha )} (-1)^{|B|} Count (k-|M(\alpha )|,B)\).

Proof

First, let \(h=k-|M(\alpha )|\). We want to count the modified substrings \(\beta \in \textit{MS}(v)\) that satisfy \(|H(\beta )|=h\) and \(R(\alpha )\cap R(\beta )=\emptyset \). For \((i,x) \in R(\alpha )\), let \(A_{(i,x)}=\{\beta \in \textit{MS}(v): |H(\beta )|=h \text { and } (i,x) \in R(\beta )\}\). Then, we want to compute \( Count (h,\emptyset )-|\bigcup _{(i,x) \in R(\alpha )} A_{(i,x)}|\). By the inclusion–exclusion principle, we have

$$\begin{aligned} \left| \bigcup _{(i,x) \in R(\alpha )} A_{(i,x)}\right|&= \sum _{B \ne \emptyset ,\, B \subseteq R(\alpha )} (-1)^{|B|+1} \left| \bigcap _{(i,x) \in B} A_{(i,x)}\right| \\&= \sum _{B \ne \emptyset ,\, B \subseteq R(\alpha )} (-1)^{|B|+1} Count (h,B), \end{aligned}$$

which concludes the proof. \(\square \)

Examples Examples of the execution of the algorithm for a binary and a ternary string can be found in Figs. 2 and 3, respectively.

Fig. 2
figure 2

Computation of (2, 3)-mappability for the string \(T=\texttt {aababba}\) from Example 1.1. Edges leading to heavy children are drawn in bold. Note that the alphabet is binary in this case, so wildcard subtrees do not need to be introduced; the only substitutions are from the (at most one) light child to the heavy child. The letters shown above are the original letters before the substitutions. The pairs of compatible modified substrings are indicated with arrows; in the binary case, 3.1 implies that these are substrings with modifications at different positions and exactly \(k=2\) modifications in total. In the end, \(A^3_{=2}[1]=A^3_{=2}[2]=1\) and \(A^3_{=2}[3]=A^3_{=2}[4]=A^3_{=2}[5]=2\) as expected

Fig. 3
figure 3

Computation of (1, 2)-mappability for the string \(T=\texttt {aabaca}\). This example illustrates the need to use of wildcard symbols for a non-binary alphabet, as otherwise pairs from different light children of the same node would not be registered. In this case \(k=1\), so modified substrings are compatible if and only if at most one of them has a modification or both have a modification of the wildcard-type which originate from different letters. We have \(A^2_{=1}[1]=4\) and \(A^2_{=1}[2]=A^2_{=1}[3]=A^2_{=1}[4]=A^2_{=1}[5]=2\)

Correctness We will show that it is enough to count pairs of modified substrings obtained in the leaves. First we show that a pair of compatible modified substrings implies a pair of length-m substrings at Hamming distance at most k.

Lemma 3.3

If \(\alpha ,\beta \in \textit{MS}(v)\) are compatible with \(i= idx (\alpha )\), and \(j= idx (\beta )\), then \(d_H(T_i^m,T_j^m) = k\).

Proof

By Eq. 3.1 we have \(W^{-1}(\alpha ) \cap W^{-1}(\beta ) = \emptyset \), so \(T_i^m\) and \(T_j^m\) differ at positions of modifications in \(W(\alpha ) = W(\beta )\). They also differ at positions of modifications in \(H(\beta )\) since at the nodes corresponding to these positions, an ancestor of \(\alpha \) (that is, the modified substring from which \(\alpha \) originates) was in the heavy child and an ancestor of \(\beta \) originated from a light child (recall that Eq. 3.1 includes \(H(\alpha )\cap H(\beta )=\emptyset \)). Symmetrically, \(T_i^m\) and \(T_j^m\) differ at positions of modifications in \(H(\alpha )\). In conclusion, they differ at positions of modifications in \(H(\alpha ) \cup H(\beta ) \cup W(\alpha )\). The three sets are disjoint, so \(|H(\alpha ) \cup H(\beta ) \cup W(\alpha )|=|H(\alpha )| + |H(\beta )| + |W(\alpha )|=k\) by Eq. 3.1. This shows that \(d_H(T_i^m,T_j^m) \ge k\). With \( val (\alpha )= val (\beta )\), we conclude that \(d_H(T_i^m,T_j^m) = k\). \(\square \)

We proceed with a proof that if two length-m substrings are at distance at most k, then some leaf contains a pair of corresponding modified substrings that are compatible. Let us start with an observation that lists some basic properties of our algorithm. Both parts can be shown by straightforward induction.

Observation 3.4

  1. (a)

    If a node v stores modified substrings \(\alpha ,\beta \in \textit{MS}(v)\), then it has a descendant \(v'\) with \(\mathbf {D}(v')=\textsf {lcp}( val (\alpha ), val (\beta ))\) and \(\alpha ,\beta \in \textit{MS}(v')\).

  2. (b)

    Every node stores at most one modified substring originating from the same substring \(T_{\ell }^m\).

We use the following auxiliary lemma.

Lemma 3.5

Assume that \(d_H(T_i^m,T_j^m)=k\) and let \(1 \le x_1< x_2< \cdots < x_k \le m\) be the indices where the two substrings differ. Further let \(x_{k+1}=m+1\). For every \(p\in \{1,\ldots ,k+1\}\), there exist a node \(v_p\) and modified substrings \(\alpha _p,\beta _p \in \textit{MS}(v_p)\) such that:

  • \( idx (\alpha _p)=i\) and \( idx (\beta _p)=j\);

  • \(\textsf {lcp}( val (\alpha _p), val (\beta _p))=x_p-1=\mathbf {D}(v_p)\);

  • for each position \(x_1,\ldots ,x_{p-1}\), both \(M(\alpha _p)\) and \(M(\beta _p)\) contain modifications of wildcard origin, or exactly one of these sets contains a modification of heavy origin;

  • there are no other modifications in \(M(\alpha _p)\) or \(M(\beta _p)\).

Proof

The proof goes by induction on p. As \(\alpha _1\) and \(\beta _1\), we take (un)modified substrings such that \( idx (\alpha _1)=i\), \( idx (\beta _1)=j\), and \(M(\alpha _1)=M(\beta _1)=\emptyset \). They are stored in the set \(\textit{MS}(r)\) for the root r, so Observation 3.4(a) guarantees the existence of a node \(v_1\) with \(\mathbf {D}(v_1)=\textsf {lcp}(\alpha _1,\beta _1)\) and \(\alpha _1,\beta _1\in \textit{MS}(v_1)\).

Let \(p>1\). By the inductive hypothesis, the set \(\textit{MS}(v_{p-1})\) contains modified substrings \(\alpha _{p-1}\) and \(\beta _{p-1}\). The node \(v_{p-1}\) has children \(w_1\), \(w_2\) corresponding to letters \(T_i^m[x_{p-1}]\) and \(T_j^m[x_{p-1}]\), respectively. If \(w_1\) is the heavy child, then \(w_2\) is a light child and a modified substring \(\beta '\) such that \( idx (\beta ')=j\) and \(M(\beta ')=M(\beta _{p-1}) \cup \{(x_{p-1},T_i^m[x_{p-1}])\}\) is inserted to \(\textit{MS}(w_1)\). Then, we take \(\alpha '=\alpha _{p-1}\). The case that \(w_2\) is the heavy child is symmetric. Finally, if both \(w_1\) and \(w_2\) are light children, a child u of \(v_{p-1}\) is created along the wildcard symbol $. There exist modified substrings \(\alpha ',\beta ' \in \textit{MS}(u)\) such that: \( idx (\alpha ')=i\), \( idx (\beta ')=j\), \(M(\alpha ')=M(\alpha _{p-1})\cup \{(x_{p-1},\$)\}\), and \(M(\beta ')=M(\beta _{p-1})\cup \{(x_{p-1},\$)\}\).

In either case, we have \(\textsf {lcp}( val (\alpha '), val (\beta '))=x_p-1\). The set \((M(\alpha ')\cup M(\beta ')) \setminus (M(\alpha _{p-1}) \cup M(\beta _{p-1}))\) contains either a modification of heavy origin in one of the modified substrings or modifications of wildcard origin in both. Hence, by the inductive hypothesis, we can set \(\alpha _p=\alpha '\) and \(\beta _p=\beta '\). The node \(v_p\) with \(\mathbf {D}(v_p)=\textsf {lcp}( val (\alpha _p), val (\beta _p))\) and \(\alpha _p,\beta _p \in \textit{MS}(v_p)\) must exist due to Observation 3.4(a). \(\square \)

Example 3.6

Let us consider strings \(\alpha =\mathtt {aab}\) and \(\beta =\mathtt {aba}\) from Fig. 2 that differ at positions \(x_1=2\) and \(x_2=3\). The trie in the figure has a path that contains nodes storing the modified substrings from the following table.

p

\(\alpha \)

\(M(\alpha _p)\)

\( val (\alpha _p)\)

\(\beta \)

\(M(\beta _p)\)

\( val (\beta _p)\)

\(\mathbf {D}(v_p)\)

1

aab

\(\emptyset \)

aab

aba

\(\emptyset \)

aba

1

2

aab

\(\{(2,\mathtt {b})\}\)

abb

aba

\(\emptyset \)

aba

2

3

aab

\(\{(2,\mathtt {b})\}\)

abb

abb

\(\{(3,\mathtt {b})\}\)

abb

3

The following corollary is a direct consequence of Lemma 3.5.

Corollary 3.7

If \(d_H(T_i^m,T_j^m)=k\), then there is a leaf v and a pair of compatible modified substrings \(\alpha ,\beta \in \textit{MS}(v)\) with \(i= idx (\alpha )\) and \(j= idx (\beta )\).

Proof

Lemma 3.5, applied for \(p=k+1\), yields a leaf \(v_{k+1}\) that contains compatible modified substrings \(\alpha =\alpha _{k+1}\) and \(\beta =\beta _{k+1}\) with \( idx (\alpha )=i\) and \( idx (\beta )=j\). \(\square \)

The following lemma, a stronger version of the corollary, together with Lemma 3.3 shows that Algorithm 1 correctly computes the mappability table \({\text {A}}_{=k}^m\).

Lemma 3.8

If \(d_H(T_i^m,T_j^m)=k\), then there is exactly one leaf v and exactly one pair of compatible modified substrings \(\alpha ,\beta \in \textit{MS}(v)\) with \(i= idx (\alpha )\) and \(j= idx (\beta )\).

Proof

Corollary 3.7 implies that there is at least one leaf that contains compatible modified substrings \(\alpha \) and \(\beta \) with \( idx (\alpha )=i\) and \( idx (\beta )=j\). Now, it suffices to check that there is no other pair of compatible modified substrings \((\alpha ',\beta ') \ne (\alpha ,\beta )\) that would be present in some leaf u and satisfy \( idx (\alpha ')=i\) and \( idx (\beta ')=j\).

We apply Lemma 3.5. Let us first note that \(M(\alpha ') \cup M(\beta ')\) must contain modifications at positions \(x_1,\ldots ,x_k\) (since \( val (\alpha ')= val (\beta ')\)) and no modifications at other positions (otherwise, \(|H(\alpha ')|+|H(\beta ')|+|W(\alpha ')|\) would exceed k). Let p be the greatest index in \(\{1,\ldots ,k+1\}\) such that \(x_p-1 \le \textsf {lcp}( val (\alpha ), val (\alpha '))\). By Observation 3.4(b), \(u \ne v_{k+1}\), so \(p\le k\).

Thus, the node \(v_p\) is an ancestor of the leaf u, but the node \(v_{p+1}\) is not. Let us consider the children \(w_1\), \(w_2\) of \(v_p\) obtained by following edges with labels \(T_i^m[x_{p}]\) and \(T_j^m[x_{p}]\), respectively. If \(w_1\) is the heavy child, \(\beta '\) must contain a modification of heavy origin at position \(x_{p}\), so \(v_{p+1}\) is an ancestor of u; a contradiction. The same contradiction is obtained in the symmetric case that \(w_2\) is the heavy child. Finally, if both \(w_1\) and \(w_2\) are light, then either both \(\alpha '\) and \(\beta '\) contain a modification of wildcard origin at position \(x_{p}\), which again gives a contradiction, or they both contain a modification of heavy origin, which contradicts the first part of Eq. 3.1. \(\square \)

Remark 3.9

The recursive approach presented above is somewhat similar to the scheme used by Thankachan et al. [34] for computing the longest common substring with up to k mismatches of two strings. We attempted to adapt the approach of [34] to computing k-mappability, but failed. Another virtue of our approach is that we obtain time complexity better by a factor of k! for super-constant k.

Implementation and complexity Our Algorithm 1, excluding the counting phase in the leaves, has exactly the same structure as Algorithm 1 in [9]. This is verified in detail in “Appendix A”. Proposition 13 from [9] provides a bound on the total number of the generated modified strings and an efficient implementation based on finger-search trees. We apply that proposition for a family \(\mathbf {F}\) composed of substrings \(T_i^m\) to obtain the following bounds.

Fact 3.10

(see [9, Proposition 13]) Algorithm 1 applied up to the leaves takes\(O(n\left( {\begin{array}{c}\log n+k+1\\ k+1\end{array}}\right) 2^k)\) time and generates \(O(n\left( {\begin{array}{c}\log n+k\\ k\end{array}}\right) 2^k)\) modified substrings.

Let us further analyze the space complexity of the algorithm.

Observation 3.11

If a node v is a child of w, then every element of \(\textit{MS}(v)\) is either an element of \(\textit{MS}(w)\) or a modified substring originating from an element of \(\textit{MS}(w)\).

Lemma 3.12

Algorithm 1 applied up to the leaves uses \(O(nk)\) working space.

Proof

We assume that, upon termination, the procedure processNodediscards the set \(\textit{MS}(v)\) and all the modified strings created during its execution. This way, the whole memory allocated within a given call to processNodeis freed. Since processNodereturns no output and its only side effects are updates of the array \(A_{=}^k\), no information is lost through such garbage collection.

A call to \(processNode(v)\) for node v partitions the list \(\textit{MS}(v)\) into sublists corresponding to \(u_1,\dots ,u_a\), creates \(2(|\textit{MS}(u_2)|+\cdots +|\textit{MS}(u_a)|)\) new modified substrings (each requiring constant space to be stored), appends them to sublists corresponding to \(u_1\) and \(u_{a+1}\), and then recurses on the sublists. In particular, the elements of the original list \(\textit{MS}(v)\) are not copied but reused in the recursive call.

Let us consider a root-to-leaf path \(\rho \) in the recursion. Each recursive call uses \(O(1)\) local variables, which take \(O(n)\) space overall. We also need to bound the total number of modified substrings created by calls to processNodefor nodes on the path \(\rho \).

By Observations 3.11 and 3.4(b), \(|\textit{MS}(v)|\) is non-increasing on \(\rho \). Moreover, if v is a light child of its parent w, then \(|\textit{MS}(v)|\le |\textit{MS}(w)|/2\). Let us consider all nodes w on \(\rho \) such that the unique child of w that is on \(\rho \) is a light child. The total number of modified strings created by the calls to \(processNode(w)\) for all such nodes w is \(O(n)\) since we can bound it from above by a geometric series that sums to \(O(n)\).

As for the calls to \(processNode(w)\) for the remaining nodes on \(\rho \), for every two modified strings they create, they put one of them in the child of w that also belongs to \(\rho \). Hence, it suffices to bound the total number of modified substrings originating from \(T_i^m\) for each position i that are in \(\textit{MS}(v)\) for some node v on \(\rho \). For a given position i, let \(\alpha _1,\dots ,\alpha _b\) be all such modified substrings originating from \(T_i^m\). By Observation 3.11, we have \(M(\alpha _1) \subsetneq M(\alpha _2) \subsetneq \dots \subsetneq M(\alpha _b)\) and thus \(b \le k\). In total, we create \(O(nk)\) modified substrings in calls to \(processNode\) on nodes of \(\rho \). \(\square \)

Next, we show how to improve the time complexity of Algorithm 1 by a relatively small change in its execution. Intuitively, we will take advantage of the fact that the modified substrings in a leaf of the recursion do not need to be sorted lexicographically.

Namely, whenever a modified substring \(\beta \) with exactly k modifications is created at a node v (i.e., \(|M(\alpha )|=k-1\) in the if-statement), we do not include \(\beta \) in the recursive call of or . Instead, an entry \(( val (\beta ),\beta )\) is inserted into a global hash table. When processing a leaf v containing modified substrings with a common value \( val (\alpha )\), we need to move all modified substrings with value \( val (\alpha )\) from the global hash table to the set \(\textit{MS}(v)\). Finally, if any modified string \(\beta \) created while processing a given node v remains in the hash table upon completion of \(processNode(v)\), then \(\beta \) is removed from the hash table together with all other modified substrings with the value \( val (\beta )\). At this moment, an artificial leaf of the recursion containing all these modified substrings is created, and the standard routine is applied to process this leaf.

Recall that the hash table uses Karp–Rabin fingerprints to index strings and collisions could incur incorrect results in the algorithm. To tackle this issue, whenever a modified substring \(\beta \) is inserted to the hash table and there is another modified substring with the same hash in the table, we pick any such modified substring \(\alpha \) and check if \( val (\alpha )= val (\beta )\) in \(O(k)\) time using \(\textsf {lcp}\) queries on T with a method that resembles kangaroo jumping [18, 28] (it requires \(O(n)\)-time preprocessing). By Lemma 3.12, the hash table contains up to \(O(nk)\) entries at any given time, so the collision probability is \(O(nk \cdot n^{-C})=O(n^{-C+2})\). Setting \(C>c+2\), we can make sure that this is dominated by the probability that the hash table fails to process the underlying insertion in \(O(1)\) amortized time.

Let us call the resulting algorithm Algorithm 1’.

Lemma 3.13

The outputs of Algorithms 1 and 1’ are the same. Moreover, Algorithm 1’ works in \(O(n\left( {\begin{array}{c}\log n+k\\ k\end{array}}\right) 2^kk)\) time with high probability (up to the leaves) and uses the same amount of space as Algorithm 1.

Proof

Let v be a leaf in the recursion of Algorithm 1. If \(\textit{MS}(v)\) contains at least one modified substring with up to \(k-1\) modifications, v will be identified by the recursive procedure of Algorithm 1’. Then, all modified substrings with exactly k modifications that belong to v are populated from the global hash table. If \(\textit{MS}(v)\) does not contain any modified substring with less than k modifications, v will be identified upon a deletion from the global hash map at the lowest internal node u of the recursion in which a modified substring belonging to \(\textit{MS}(v)\) was created. Here, we use the fact that the path-labels \(\mathbf {L}(u)\) of all nodes u of the recursion are different. This shows that indeed the leaves of the recursion of Algorithms 1 and 1’ are the same.

As for the time complexity, the total number of modified substrings created by Algorithm 1’ is the same as in Algorithm 1, i.e., \(O(n\left( {\begin{array}{c}\log n+k\\ k\end{array}}\right) 2^k)\) by Fact 3.10. However, the time necessary to conduct the whole recursive procedure corresponds to the time complexity of Algorithm 1 if it had been executed with \(k-1\) instead of k, i.e., also \(O(n\left( {\begin{array}{c}\log n+k\\ k\end{array}}\right) 2^k)\) by Fact 3.10. After \(O(n)\)-time preprocessing, for each modified substring, we can compute its Karp–Rabin fingerprint and check collisions in \(O(k)\) time; this accounts for the additional factor k in the time complexity.

Finally, the space complexity stays the same because modified substrings with exactly k modifications are removed from the hash table at latest when the recursion rolls back. \(\square \)

Lemmas 3.12 and 3.13 yield the complexity of Algorithm 1’. Note that, due to the application of the inclusion-exclusion principle in the leaves, we need to multiply the time complexity of the algorithm by \(2^k\) and increase the space complexity by \(O(n2^kk)\).

Theorem 3.14

There exists a Las-Vegas randomized algorithm that computes the (km)-mappability of a given length-n string in \(O(n2^kk)\) space and, with high probability, in \(O(n\left( {\begin{array}{c}\log n+k\\ k\end{array}}\right) 4^kk)\) time. For \(k = O(1)\), the space is \(O(n)\) and the time becomes \(O(n \log ^{k} n)\).

4 All-Pairs Hamming Distance Problem

Let us recall that in the all-pairs Hamming distance problem, given a set \(\mathbf {R}\) of r length-m strings and an integer \(k\in \{0,\ldots , m\}\), we are to return all pairs \((X_1,X_2)\in \mathbf {R}\times \mathbf {R}\), with \(X_1\ne X_2\), such that \(X_1\) and \(X_2\) are at Hamming distance at most k. We will show how the algorithm from the previous section can be modified to solve this problem at the cost of an additional \(\log r\)-factor in the complexity.

We run the algorithm from the previous section for T being a concatenation of all the strings in \(\mathbf {R}\) and only with substrings \(\{T_i^m: i \text { mod } m=1\}\) in the root. The algorithm needs to be updated only at the leaves of the compact trie. Henceforth, let us consider a trie leaf v with a set \(\textit{MS}(v)=\{\beta _1,\ldots ,\beta _p\}\) of modified substrings. We will further denote this set as \(\textit{MS}\) (\(|\textit{MS}|=p\)). Our goal is to list, for every \(\beta \in \textit{MS}\), all \(\beta ' \in \textit{MS}\) that are compatible with \(\beta \).

Let us construct a static balanced binary search tree (BST) in which the leaves correspond to the modified substrings \(\beta _i\). This way, each node of the BST corresponds to a set of subsequent candidates from the leaves of its subtree. If \(\beta _i,\ldots ,\beta _j\) are the modified substrings in the leaves of the subtree of a BST node u, then we denote \( set (u) = \{\beta _i,\ldots ,\beta _j\}\). A leaf will be responsible for storing information only for itself and an internal node stores merged information of its children.

Our goal is to store information in each node u of the BST in such a way that, for any modified substring \(\alpha \in \textit{MS}\), we will be able to decide whether there is any other candidate in \( set (u)\) that is compatible with \(\alpha \). Therefore, in each node u, we will compute all the required machinery for using the inclusion-exclusion principle on the modified substrings in \( set (u)\), that is, a dictionary that stores all non-zero values of \( Count (s,B)\) for modified substrings \(\beta \in set (u)\). Since every \(\beta \in \textit{MS}\) is present in \(O(\log p)\) sets \( set (u)\), precomputing all mentioned information can be done in \(O(2^kk p \log p)\) time and space.

Our query algorithm for a given modified substring \(\beta \) is a recursive procedure starting at the root of the BST. Assume that the algorithm is at some BST node u. We use Lemma 3.2 and the dictionary for \( set (u)\) to count the elements \(\beta ' \in set (u)\) that are compatible with \(\beta \). If this number is positive, the algorithm recursively descends to the children of node u. In the end, modified substrings \(\beta '\) that are compatible with \(\beta \) will be listed at the leaves of the BST. The correctness of this algorithm follows from Lemma 3.8.

Every application of Lemma 3.2 takes \(O(2^kk)\) time. For each modified substring \(\beta '\) that is compatible with a modified substring \(\beta \), the algorithm will visit \(O(\log p)\) BST nodes, which gives \(O(2^kk \log p)\) time for finding each compatible modified substring \(\beta ' \in \textit{MS}\). Note that \(p \le r\) (see Observation 3.4(b)). Summing up over all trie nodes v and applying Lemmas 3.13 and 3.12, we obtain the following result. (Observe that [9, Proposition 13] is applied for a family \(\mathbf {F}\) of size r rather than n.)

Theorem 4.1

There exists a Las-Vegas randomized algorithm that, given a set of r length-m strings and an integer k, solves the all-pairs Hamming distance problem in \(O(rm+2^kk r\log r)\) space and, with high probability, in \(O(rm + r\left( {\begin{array}{c}\log r+k\\ k\end{array}}\right) 4^kk\log r+ \mathsf {output}\,\cdot 2^kk \log r)\) time. For \(k = O(1)\), the space is \(O(rm+r\log r)\) and the time becomes \(O(rm + r \log ^{k+1} r+ \mathsf {output}\cdot \log r)\).

Notably, the algorithm underlying Theorem 4.1 works in O(rm) time (with high probability) if \(k=O(1)\), \(m=\Omega (\log ^{k+1} r)\), and \(\mathsf {output}=O(rm/\log r)\).

5 Computing Mappability in \(O(nm^k)\) Time and \(O(n)\) Space

In this section, we generalize the \(O(nm)\)-time algorithm for \(k=1\) and integer alphabets from [3]. To this end, we make use of an approach from [6]. The high-level idea from [6] is to define a lexicographic order on the suffixes of T that ignores the same k fixed positions of every suffix. (In fact, the algorithm does the same for many such combinations of k positions.) The algorithm then uses the suffix tree of T to sort the modified suffixes according to this new lexicographic order. The focus of the original algorithm is not on counting substrings that are at Hamming distance at most k, and so we adapt it with some extra care to avoid multiple counting.

We first generate all \(\left( {\begin{array}{c}m\\ \le k\end{array}}\right) \) subsets of \(\{1, \ldots , m \}\) of size at most k. For each such subset F, we consider the length-m substrings of T with their f-th letter substituted with \(\$ \not \in \Sigma \) for all \(f \in F\). We sort all these sets of strings in \(O(nk\left( {\begin{array}{c}m\\ \le k\end{array}}\right) )\) total time using the approach of [6], also obtaining the maximal blocks of equal strings in the sorted lists.

We now briefly describe the algorithm for sorting one such set of strings in time \(O(nk)\) for the sake of completeness. Let us assume for simplicity that \(F=\{f\}\) as the algorithm can be generalized trivially for larger sets. We first retrieve the sorted list of \(T_i^{f-1}\) for all i from the suffix tree. We then give ranks to these strings after we check equality of adjacent strings in the sorted list using \(\textsf {lcp}\) queries. We similarly rank strings \(T_{j}^{m-f}\) for all j. Finally, we sort the ranks of the pairs \((T_i^{f-1}, T_{i+f+1}^{m-f})\) using bucket sort.

Prior to running the above algorithm, we initialize arrays \(D_K\) for \(K \in \{1, \ldots , k\}\). For each maximal block, of size b, of equal strings obtained for some set F, we increment the b relevant entries of \(D_{|F|}\) by \(b-1\).

Note that if \(d_H(T_i^m,T_j^m)=\kappa \), then this will contribute \(\left( {\begin{array}{c}m-\kappa \\ K-\kappa \end{array}}\right) \) to each of \(D_{K}[i]\) and \(D_{K}[j]\) for \(K\ge \kappa \), since there are this many size-K supersets of the set of mismatching positions in the power set of \(\{1, \ldots , m \}\). We thus compute \({\text {A}}_{={K}}^m[i]=D_K[i]-\sum _{\kappa =0}^{K-1} \left( {\begin{array}{c}m-\kappa \\ K-\kappa \end{array}}\right) {\text {A}}_{=\kappa }[i]\) in increasing order with respect to K, and we are done. (We precompute all relevant binomial coefficients in \(O(k^2)\) time.)

Theorem 5.1

Given a string of length n, the (km)-mappability problem can be solved in \(O(nk\left( {\begin{array}{c}m\\ \le k\end{array}}\right) )\) time and \(O(n)\) space. For \(k= O(1)\), the time becomes \(O(n m^k)\).

Combining Theorems 3.14 and 5.1 gives the following result.

Corollary 5.2

For every \(k=O(1)\), there exists a randomized algorithm that computes the (km)-mappability of a given length-n string in \(O(n)\) space and in \(O(n \cdot \min \{m^k,\log ^k n\})\) time with high probability.

6 Computing (km)-Mappability for All k or for All m

Theorem 6.1

The (km)-mappability for a given m and all \(k\in \{0, \ldots , m\}\) can be computed in \(O(n^2)\) time using \(O(n)\) space.

Proof

We first present an algorithm which solves the problem in \(O(n^2)\) time using \(O(n^2)\) space and then show how to reduce the space usage to \(O(n)\).

We initialize an \(n \times n\) matrix M in which M[ij] will store the Hamming distance between substrings \(T_i^m\) and \(T_j^m\). Let us consider two letters \(T[i]\ne T[j]\) of the input string, where \(i < j\). Such a pair contributes to a mismatch between the following pairs of strings:

$$\begin{aligned}(T_{i - m + 1}^m, T_{j - m + 1}^m), (T_{i - m + 2}^m, T_{j - m + 2}^m), \ldots , (T_i^m, T_j^m).\end{aligned}$$

This list of strings is represented by a diagonal interval in M, the entries of which we need to increment by 1. We process all \(O(n^2)\) pairs of letters and update the information on the respective intervals. Then \({\text {A}}_{= k}^m[i] = |\{j\,:\,M[i,j]= k\}|\).

To achieve \(O(1)\) time for each single addition on a diagonal interval, we use a well-known trick from an analogous problem in one dimension. Suppose that we would like to add 1 on the diagonal interval from \(M[x_1, y_1]\) to \(M[x_2, y_2]\). Instead, we can simply add 1 to \(M[x_1, y_1]\) and \(-1\) to \(M[x_2 + 1, y_2 + 1]\). Every cell will then represent the difference of its actual value to the actual value of its predecessor on the diagonal. After all such operations are performed, we can retrieve the actual values by computing prefix sums on each diagonal in a top-down manner.

To reduce the space usage to \(O(n)\), it suffices to observe that the value of M[ij] depends only on the value of \(M[i-1,j-1]\) and at most two letter comparisons which can add \(+1\) and/or \(-1\) to the cell. Recall that \(M[i, j]=d_H(T_i^m,T_j^m)\). We need to subtract 1 from the previous result if the first characters of the previous substrings were equal and add 1 if the last characters of the new substrings were different. Therefore, we can process the matrix row by row, from top to bottom, and compute the values \({\text {A}}_{=0}^m[i],\ldots ,{\text {A}}_{={m}}^m[i]\) while processing the ith row. \(\square \)

By \(\textsf {lcp}_k(i,j)\) we denote the length of the longest common prefix of \(T_i\) and \(T_j\) when up to k mismatches are allowed, that is, the maximum \(\ell \) such that \(d_H(T_i^\ell ,T_j^\ell ) \le k\). Flouri et al. [15] proposed an \(O(n^2)\)-time algorithm to compute the longest common substring of two strings S, T with at most k mismatches. Their algorithm actually computes the lengths of the longest common prefixes with at most k mismatches of every suffix of S and T and returns the maximum among them. Applied for \(S=T\), it gives the following result.

Lemma 6.2

[15] For a string T of length n, the values \(\textsf {lcp}_k(i,j)\) for all \(i,j \in \{1,\dots ,n\}\) can be computed in \(O(n^2)\) time.

Theorem 6.3

The (km)-mappability for a given k and all \(m\in \{k,\ldots , n\}\) can be computed in \(O(n^2)\) time and space.

Proof

First we compute all the values \(\textsf {lcp}_k(i,j)\) using Lemma 6.2. We initialize an \(n \times n\) matrix Q setting all entries to 0. Then, for a pair (ij) such that \(\textsf {lcp}_k(i,j)=\ell \), we increment the entries \(Q[\ell ,i]\) and \(Q[\ell ,j]\). Note that if \(\textsf {lcp}_k(i,j)=\ell \), then i (resp. j) will contribute 1 to the (km)-mappability values \({\text {A}}^m_{\le k}[j]\) (resp. \({\text {A}}^m_{\le k}[i]\)) for all \(m \in \{k, \ldots , \ell \}\). Thus, starting from the last row of Q, we iteratively add row \(\ell \) to row \(\ell -1\). By the above observation, row m ends up storing the (km)-mappability array \({\text {A}}^m_{\le k}\). \(\square \)

7 Conditional Hardness for \(k,m = \Theta (\log n)\)

We will show that (km)-mappability cannot be computed in strongly subquadratic time in case that the parameters are \(\Theta (\log n)\), unless the Strong Exponential Time Hypothesis (SETH) of Impagliazzo, Paturi and Zane [22, 23] fails. Our proof is based on the conditional hardness of the following decision version of the Longest Common Substring with k Mismatches problem.

figure d

Lemma 7.1

[27] Suppose there is \(\epsilon > 0\) such that Common Substring of Length d with k Mismatches can be solved in \(O(n^{2-\epsilon })\) time on strings over binary alphabet for \(k = \Theta (\log n)\) and \(d=21k\). Then SETH is false.

Theorem 7.2

If the (km)-mappability can be computed in \(O(n^{2-\epsilon })\) time for binary strings, \(k,m=\Theta (\log n)\), and some \(\epsilon >0\), then SETH is false.

Proof

We make a Turing reduction from Common Substring of Length d with k Mismatches. Let S and T be the input to the problem. We compute the (kd)-mappabilities of strings \(S\cdot T\) and \(S\cdot T_1^{d-1}\) and store them in arrays A and B, respectively. Henceforth, we consider only indices \(i\in \{1,\ldots ,n-d+1\}\) in the arrays. For each such index, A[i] holds the number of length-d factors of S, \(X:=S_{n-d+2}^{d-1} T_1^{d-1}\), and T that are at Hamming distance k from \(S_i^d\), and B[i] holds the number of length-d factors of S and X that are at Hamming distance k from \(S_i^d\). For each i, we subtract B[i] from A[i]. Then, A[i] holds the number of length-d factors of T that are at Hamming distance k from \(S_i^d\). Hence, Common Substring of Length d with k Mismatches has a positive answer if and only if \(A[i]>0\) for any \(i\in \{1,\ldots ,n-d+1\}\).

By Lemma 7.1, an \(O(n^{2-\epsilon })\)-time algorithm for Common Substring of Length d with k Mismatches with \(k=\Theta (\log n)\) and \(d=21k\) would refute SETH. By the shown reduction, an \(O(n^{2-\epsilon })\)-time algorithm for (km)-mappability with \(k,m = \Theta (\log n)\) would also refute SETH. \(\square \)

8 Final Remarks

Our main contribution is an \(O(n \cdot \min \{m^k,\log ^k n\})\)-time \(O(n)\)-space algorithm for solving the (km)-mappability problem for a length-n string over an integer alphabet. Let us recall that genome mappability, as introduced in [12], counts the number of substrings that are at Hamming distance at most k from every length-m substring of the text. One may also be interested to consider mappability under the edit distance model. This question relates also to recent contributions to computing approximate longest common prefixes and substrings under edit distance [6, 33]. In the case of the edit distance, in particular, a decision needs to be made whether sufficiently similar substrings only of length exactly m or of all lengths between \(m-k\) and \(m+k\) should be counted. We leave the mappability problem under edit distance for future investigation.