In this section, we describe how to transform the combinatorial properties of Section 3 to a space-efficient algorithm for computing the MAWs of a collection of words. Note that we do not analyze the time complexity of this algorithm.
In what follows, we say that word aub, where a and b are letters, is the overlap of the words au and ub.
Consider we have only two words y1 and y2. We apply Lemma 2 to construct the reduced sets \(\mathrm {R}^{\ell }_{y_{1}}\) from \(\mathrm {M}^{\ell }_{y_{1}}\) and \(\mathrm {R}^{\ell }_{y_{2}}\) from \(\mathrm {M}^{\ell }_{y_{2}}\). Recall that words in \(\mathrm {R}^{\ell }_{y_{1}}\) are factors of y2 and words in \(\mathrm {R}^{\ell }_{y_{2}}\) are factors of y1.
Definition 1 (ℓ − 1 extensions)
Let {i,j} = {1, 2}. For every word in \(\mathrm {R}^{\ell }_{y_{i}}\), we consider all its occurrences in yj and take all the factors obtained by extending these occurrences to the left up to length ℓ − 1. We then obtain the set \(\overleftarrow {\mathrm {R}^{\ell }_{y_{i}}}\) of (ℓ − 1)-left extensions of words of \(\mathrm {R}^{\ell }_{y_{i}}\) in yj. Similarly, we define the set \(\overrightarrow {\mathrm {R}^{\ell }_{y_{i}}}\) of (ℓ − 1)-right extensions of words of \(\mathrm {R}^{\ell }_{y_{i}}\) in yj.
Formally, \(\overleftarrow {\mathrm {R}^{\ell }_{y_{i}}}=\{wv \in \text {Fact}(y_{j}) \colon v\in \mathrm {R}^{\ell }_{y_{i}} \text { and } |wv|\leq \ell -1 \}\) and \(\overrightarrow {\mathrm {R}^{\ell }_{y_{i}}}=\{vw \in \text {Fact}(y_{j}) \colon v\in \mathrm {R}^{\ell }_{y_{i}} \text { and } |vw|\leq \ell -1\}\).
Example 3 (Two sequences)
Let y1 = abaab, y2 = bbaaab and ℓ = 5. We have \(\mathrm {R}^{\ell }_{y_{1}}=\{\texttt {aaa},\texttt {bb}\}\) and \(\mathrm {R}^{\ell }_{y_{2}}=\{\texttt {aba},\texttt {baab}\}\).
We take aaa from \(\mathrm {R}^{\ell }_{y_{1}}\), search for it in y2, and extend it to the left to get \(\texttt {\underline {b}aaa}\) (we underline the letters added in the extension). The word bb cannot be extended further to the left as it appears only at the beginning of y2. The set of left extensions of words in \(\mathrm {R}^{\ell }_{y_{1}}\) is therefore \(\overleftarrow {\mathrm {R}^{\ell }_{y_{1}}}=\{\texttt {\underline {b}aaa},\texttt {aaa},\texttt {bb}\}\). The words of \(\mathrm {R}^{\ell }_{y_{2}}\) cannot be (ℓ − 1)-left-extended further in y1, hence \(\overleftarrow {\mathrm {R}^{\ell }_{y_{2}}}=\mathrm {R}^{\ell }_{y_{2}}=\{\texttt {aba},\texttt {baab}\}\). Similarly, the (ℓ − 1)-right extensions of words of \(\mathrm {R}^{\ell }_{y_{2}}\) in y1 are \(\overrightarrow {\mathrm {R}^{\ell }_{y_{2}}}=\{\texttt {aba},\texttt {aba\underline {a}},\texttt {baab}\}\), and the (ℓ − 1)-right extensions of words of \(\mathrm {R}^{\ell }_{y_{1}}\) in y2 are \(\overrightarrow {\mathrm {R}^{\ell }_{y_{1}}}=\{\texttt {aaa},\texttt {aaa\underline {b}},\texttt {bb},\texttt {bb\underline {a}},\texttt {bb\underline {aa}}\}\).
Lemma 4
Let {i,j} = {1,2}. A word aub, where a and b are letters, belongs to \(\mathrm {M}^{\ell }_{{\{y_{1},y_{2}\}}} \setminus (\mathrm {M}^{\ell }_{y_{i}} \cup \mathrm {M}^{\ell }_{y_{j}})\) if and only if there exists a pair \((x_{i},x_{j}) \in \mathrm {R}^{\ell }_{y_{i}} \times \mathrm {R}^{\ell }_{y_{j}}\) such that:
-
1.
au is an (ℓ − 1)-right extension of xi;
-
2.
ub is an (ℓ − 1)-left extension of xj.
Proof
It follows directly from Theorem 1. □
As a consequence, in order to find the words in \(\mathrm {M}^{\ell }_{{\{y_{i},y_{j}\}}} \setminus (\mathrm {M}^{\ell }_{y_{i}} \cup \mathrm {M}^{\ell }_{y_{j}})\), we take all the possible words that are obtained as the overlap of an (ℓ − 1)-right extension of a word in \(\mathrm {R}^{\ell }_{y_{i}}\) with an (ℓ − 1)-left extension of a word in \(\mathrm {R}^{\ell }_{y_{j}}\).
Example 4 (Two sequences, continued)
Let y1 = abaab, y2 = bbaaab and ℓ = 5. We start from \(\overleftarrow {\mathrm {R}^{\ell }_{y_{1}}}=\{\texttt {\underline {b}aaa},\texttt {aaa},\texttt {bb}\}\) and \(\overrightarrow {\mathrm {R}^{\ell }_{y_{2}}}=\{\texttt {aba},\texttt {aba\underline {a}},\texttt {baab}\}\). We take \(\texttt {aba\underline {a}}\) from \(\overrightarrow {\mathrm {R}^{\ell }_{y_{2}}}\) and \(\texttt {\underline {b}aaa}\) from \(\overleftarrow {\mathrm {R}^{\ell }_{y_{1}}}\). They overlap forming the word abaaa, which belongs to \(\mathrm {M}^{\ell }_{{\{y_{1},y_{2}\}}}\). Next, we consider \(\overrightarrow {\mathrm {R}^{\ell }_{y_{1}}}=\{\texttt {aaa},\texttt {aaa\underline {b}},\texttt {bb},\texttt {bb\underline {a}},\texttt {bb\underline {aa}}\}\) and \(\overleftarrow {\mathrm {R}^{\ell }_{y_{2}}}=\{\texttt {aba},\texttt {baab}\}\). We take \(\texttt {bb\underline {aa}}\) from \(\overrightarrow {\mathrm {R}^{\ell }_{y_{1}}}\) and baab from \(\overleftarrow {\mathrm {R}^{\ell }_{y_{2}}}\). They overlap forming the word bbaab, which belongs to \(\mathrm {M}^{\ell }_{{\{y_{1},y_{2}\}}}\). This completes \(\mathrm {M}^{\ell }_{{\{y_{1},y_{2}\}}}\), as there are no other overlaps.
It should now be clear how this approach can be generalized to any number of words. We show this in the next example.
Example 5 (Three sequences)
Let y1 = abaab, y2 = bbaaaby3 = babababaa and ℓ = 5. We have that \(\mathrm {M}^{\ell }_{\{y_{1},y_{2}\}}=\{\texttt {aaaa,bab,aaba,abaaa,} \texttt {bbaab,abb,bbb}\}\) and \(\mathrm {M}^{\ell }_{y_{3}}=\{\texttt {aaa,bb,aab}\}\).
We want to compute \(\mathrm {M}^{\ell }_{\{y_{1},y_{2},y_{3}\}}=\{\texttt {aaaa,aaba,abaaa,bbaab,bbab,abb,} \texttt {bbb}\}\). We have \(\mathrm {M}^{\ell }_{\{y_{1},y_{2}\}}\cap \mathrm {M}^{\ell }_{y_{3}}=\emptyset \). Next, we examine \((\mathrm {M}^{\ell }_{\{y_{1},y_{2}\}}\cup \mathrm {M}^{\ell }_{y_{3}}) \setminus (\mathrm {M}^{\ell }_{\{y_{1},y_{2}\}}\cap \) \(\mathrm {M}^{\ell }_{y_{3}})\). By applying Lemma 2 we can infer that \(\texttt {abaaa,bbaab,bbb,aaaa,} \texttt {aaba,abb} \in \mathrm {M}^{\ell }_{\{y_{1},y_{2},y_{3}\}}\). Thus, we have \(\mathrm {R}^{\ell }_{\{y_{1},y_{2}\}}=\{\texttt {bab}\}\) and \(\mathrm {R}^{\ell }_{y_{3}}= \{\texttt {bb}, \texttt {aaa},\texttt {aab}\}\). Finally, to obtain bbab we apply Lemma 4 as follows. We build the sets of (ℓ − 1)-extensions \(\overleftarrow {\mathrm {R}^{\ell }_{\{y_{1},y_{2}\}}}=\{\texttt {bab},\texttt {\underline {a}bab}\}\), \(\overleftarrow {\mathrm {R}^{\ell }_{y_{3}}}=\{\texttt {bb},\) \(\texttt {aaa},\texttt {\underline {b}aaa},\texttt {aab},\texttt {\underline {a}aab},\texttt {\underline {b}aab}\}\), \(\overrightarrow {\mathrm {R}^{\ell }_{\{y_{1},y_{2}\}}}=\{\texttt {bab},\texttt {bab\underline {a}}\}\) and \(\overrightarrow {\mathrm {R}^{\ell }_{y_{3}}}=\{\texttt {bb},\texttt {bb\underline {a}},\texttt {bb\underline {aa}},\texttt {aaa},\texttt {aaa\underline {b}},\texttt {aab}\}\). Computing all the possible overlaps of a word in \(\overrightarrow {\mathrm {R}^{\ell }_{\{y_{1},y_{2}\}}}\) with a word in \(\overleftarrow {\mathrm {R}^{\ell }_{y_{3}}}\) and all possible overlaps of a word in \(\overrightarrow {\mathrm {R}^{\ell }_{y_{3}}}\) with a word in \(\overleftarrow {\mathrm {R}^{\ell }_{\{y_{1},y_{2}\}}}\) we get the word bbab, which is the overlap of bba \(\in \overrightarrow {\mathrm {R}^{\ell }_{y_{3}}}\) with bab \(\in \overleftarrow {\mathrm {R}^{\ell }_{\{y_{1},y_{2}\}}}\). This completes \(\mathrm {M}^{\ell }_{\{y_{1},y_{2},y_{3}\}}\), as there are no other overlaps.
For the space-efficient algorithm, we will need to consider (ℓ − 1)-extensions of words in the reduced sets one at a time. For this, we define the set of (ℓ − 1)-extensions of a single reduced word. Formally, for a word x in \(\mathrm {R}^{\ell }_{y_{i}}\), we define \(\overleftarrow {X}=\{wx \in \text {Fact}(y_{j}) \colon |wx|\leq \ell -1 \}\) and \(\overrightarrow {X}=\{xw \in \text {Fact}(y_{j}) \colon |xw|\leq \ell -1\}\). Note that \(x\in \overleftarrow {X}\) and \(x\in \overrightarrow {X}\) and that \(\overleftarrow {\mathrm {R}^{\ell }_{y_{i}}}=\bigcup _{x\in \mathrm {R}^{\ell }_{y_{i}}}\overleftarrow {X}\) and \(\overrightarrow {\mathrm {R}^{\ell }_{y_{i}}}=\bigcup _{x\in \mathrm {R}^{\ell }_{y_{i}}}\overrightarrow {X}\). For example, for the word \(x=\texttt {aab}\in \mathrm {R}^{\ell }_{y_{3}}\) in the previous example, we have \(\overleftarrow {X}=\{\texttt {aab},\texttt {\underline {a}aab},\texttt {\underline {b}aab}\}\).
We now describe our algorithm. Let \(\mathcal {T}(x)\) be the suffix tree of word x and \(\mathcal {T}(X)\) be the generalized suffix tree of the words in set X. Let us further define the following operation over \(\mathcal {T}(x)\): occ(x,u) returns the starting positions of all occurrences of word u in word x. Let \(y_{\max \limits _{N}}\) denote the longest word in the collection {y1, … , yN}.
Let N = 1. We read y1 from memory, construct \(\mathcal {T}(y_{1})\) [21], compute \(\mathrm {M}^{\ell }_{y_{1}}\) [7], and construct \(\mathcal {T}(\mathrm {M}^{\ell }_{y_{1}})\). We report \(\mathrm {M}^{\ell }_{y_{1}}\) as our first output. The space used thus far is bounded by \(\mathcal {O}(|y_{1}|+\| \mathrm {M}^{\ell }_{y_{1}}\| )=\mathcal {O}(|y_{\max \limits _{1}}|+\| \mathrm {M}^{\ell }_{y_{1}}\| )\).
At the N th step, we already have \(\mathcal {T}(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}})\) in memory from the (N − 1)th step. We read yN from memory and construct \(\mathcal {T}(y_{N})\). The incremental step, for all N ∈ [2,k], works as follows.
- Case 1:
-
: We want to check all pairs \((x_{1},x_{2}) \in \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}} \times \mathrm {M}^{\ell }_{y_{N}}\), applying Lemma 2, to construct the set
$$M=\{w \in \mathrm{M}^{\ell}_{\{y_1,\ldots,y_N\}} \colon w\in \mathrm{M}^{\ell}_{\{y_1,\ldots,y_{N-1}\}} \cup \mathrm{M}^{\ell}_{y_{N}}\}$$
and the sets \(\mathrm {R}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}, \mathrm {R}^{\ell }_{y_{N}}\). We proceed as follows. We first compute \(\mathrm {M}^{\ell }_{y_{N}}\) using \(\mathcal {T}(y_{N})\). At this point note that we cannot store \(\mathrm {M}^{\ell }_{y_{N}}\) explicitly since it could be the case that \(\| \mathrm {M}^{\ell }_{y_{N}}\| =\omega (\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )\). Instead, we output the words in the following constant-space form: < i1,i2,α > per word [7]; such that \(y_{N}[i_{1}. . i_{2}]\cdot \alpha \in \mathrm {M}^{\ell }_{y_{N}}\), where α ∈Σ. In this case, the space used is bounded by \(\mathcal {O}(|y_{\max \limits _{N}}|+\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )\). We perform the following:
-
1.
We first want to check if the elements of \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\) are superwords of any element of \(\mathrm {M}^{\ell }_{y_{N}}\). We search for \(x_{2} \in \mathrm {M}^{\ell }_{y_{N}}\) in \(\mathcal {T}(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}})\), one element from \(\mathrm {M}^{\ell }_{y_{N}}\) at a time. By going through all the occurrences of x2 using \(\mathcal {T}(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}})\), we find all elements in \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\) that are superwords of some x2.
-
2.
We next want to check if \(x_{2} \in \mathrm {M}^{\ell }_{y_{N}}\) is a superword of any element of \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\). We use the classic matching statistics algorithm (see Section 7.8 of [23]), on \(\mathcal {T}(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}})\). This algorithm finds the longest prefix of x2[i..|x2|− 1] that matches any factor of the elements in \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\), for all i ∈ [0,|x2|− 1], using \(\mathcal {T}(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}})\). By definition, no element in \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\) is a factor of another element in the same set. Thus, if a longest match corresponds to an element in \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\), this can be found using \(\mathcal {T}(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}})\).
We create set \(\mathrm {R}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\) explicitly since it is a subset of \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\). Set \(\mathrm {R}^{\ell }_{y_{N}}\) is created implicitly: every element \(x_{2} \in \mathrm {R}^{\ell }_{y_{N}}\) is stored as a tuple < i1,i2,α > such that x2 = yN[i1..i2] ⋅ α. (Recall that yN is currently stored in memory.) We can thus store every element of \(\{x_{2}: x_{2} \in M \cap \mathrm {M}^{\ell }_{y_{N}}\}\) with the same representation. All other elements of M can be stored explicitly as they are elements of \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\). The space used thus far is thus bounded by \(\mathcal {O}(|y_{\max \limits _{N}}|+\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\| )\).
- Case 2:
-
: We want to compute \(\{w \colon w \in \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}, w \notin \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}} \cup \mathrm {M}^{\ell }_{y_{N}}\}\). To this end, we consider all pairs \((x_{1},x_{2}) \in \mathrm {R}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}} \times \mathrm {R}^{\ell }_{y_{N}}\). (We symmetrically consider all pairs \((x_{1},x_{2}) \in \mathrm {R}^{\ell }_{y_{N}} \times \mathrm {R}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\).)
We read yN to construct \(\mathcal {T}(y_{N})\).
Then we consider one by one each pair (x1,x2) of \(\mathrm {R}^{\ell }_{\{y_{1},\ldots ,{y_{N-1}}\}} \times \mathrm {R}^{\ell }_{y_{N}}\).
Using \(\mathcal {T}(y_{N})\) we compute PN = occ(yN,x1) and
\(\overleftarrow {X_{1}} = \{wx_{1} \in \text {Fact}(y_{N}) \colon |wx_{1}|\leq \ell -1 \}\).
For all i ∈ [1,N − 1] we perform the following:
-
1.
We read yi from memory and construct \(\mathcal {T}(y_{i})\)
-
2.
We compute Pi = occ(yi,x2) and the set \(\overrightarrow {X_{2}} \cap \text {Fact}(y_{i}) = \{x_{2} w \in \text {Fact}(y_{i}) \colon |x_{2} w|\leq \ell -1 \}\)
-
3.
For each \((\overleftarrow {x_{1}},\overrightarrow {x_{2}})\in (\overleftarrow {X_{1}},\overrightarrow {X_{2}}\cap \text {Fact}(y_{i}))\) with \(|\overleftarrow {x_{1}}|=|\overrightarrow {x_{2}}|\):
-
(a)
We check the equality of the longest proper suffix of \(\overrightarrow {x_{2}}\) and the longest proper prefix of \(\overleftarrow {x_{1}}\). If there is equality we note \(u=\overrightarrow {x_{2}}[1{\ldots } |\overrightarrow {x_{2}}|-1] = \overleftarrow {x_{1}}[0{\ldots } |\overleftarrow {x_{1}}|-2]\).
-
(b)
By Lemma 4, x2[0] ⋅ u ⋅ x1[|x1|− 1] is a MAW of the set of words {y1, … , yN} and so we add element x2[0] ⋅ u ⋅ x1[|x1|− 1], expressed implicitly, to set M.
Note that \(|P_{N}|=\mathcal {O}(|y_{N}|)\) for all x1 and \(|P_{i}|=\mathcal {O}(|y_{i}|)\) for all x2. If one pair of extensions is handled at a time, the space used is bounded by \(\mathcal {O}(|y_{\max \limits _{N}}|+\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\| )\). We repeat the same process with word yi, for all i ∈ [2,N − 1].
Finally, we delete the set \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\) and \(\mathcal {T}(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}})\), we set \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}=M\), where every element is now safely expressed explicitly, and construct \(\mathcal {T}(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}})\). We are now ready for step N + 1.
We arrive at the following result.
Theorem 2
If \(\mathcal {O}(|y_{\max \limits _{N-1}}|+\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\| )\) space is made available at step N − 1, we can compute \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\) from \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_{N-1}\}}\) using \(\mathcal {O}(|y_{\max \limits _{N}}|+\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )\) space.
Proof
The space complexity follows from the above discussion. The correctness follows from Lemmas 2 and 4. □
We obtain immediately the following corollary.
Corollary 1
If \(\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =\mathcal {O}(\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\| )\), for all 1 ≤ N < k, we can compute \(\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\) using \(\mathcal {O}(|y_{\max \limits _{k}}|+\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}\| )\) space.