Constructing Antidictionaries of Long Texts in Output-Sensitive Space

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set M{y1,…,yk}ℓ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$\end{document} of minimal absent words of length at most ℓ of the collection {y1, … , yk}. The set M{y1,…,yk}ℓ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$\end{document} contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set Myℓ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathrm {M}^{\ell }_{y}$\end{document} of minimal absent words of a word y is equal to M{y1,…,yk}ℓ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$\end{document} for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available O(n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}(n)$\end{document}-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ∥M{y1,…,yN}ℓ∥=o(n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)$\end{document}, for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but ∥M{y1,…,yk}12∥≈106\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}$\end{document}. We consider a constant-sized alphabet for stating our results. We show that allMy1ℓ,…,M{y1,…,yk}ℓ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$\end{document} can be computed in O(kn+∑N=1k∥M{y1,…,yN}ℓ∥)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )$\end{document} total time using O(MaxIn+MaxOut)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})$\end{document} space, where MaxIn is the length of the longest word in {y1, … , yk} and MaxOut=max{∥M{y1,…,yN}ℓ∥:N∈[1,k]}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}$\end{document}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.


Introduction
The word x is an absent word of the word y if it does not occur in y. The absent word x of y is called minimal if and only if all its proper factors occur in y. The set of all minimal absent words for a word y is denoted by M y . The set of all minimal absent words of length at most of a word y is denoted by M y . For example, if y = abaab, then M y = {aaa, aaba, bab, bb} and M 3 y = {aaa, bab, bb}. The upper bound on the number of minimal absent words is O(σ n) [2], where σ is the size of the alphabet and n is the length of y, and this bound is tight for integer alphabets [3]; in fact, for large alphabets, such as when σ ≥ √ n, this bound is tight even for minimal absent words having the same length [4,5].
State-of-the-art algorithms compute all minimal absent words of y in O(σ n) time [2,6,7] or in O(n + M y ) time [8,9] for integer alphabets. There also exist space-efficient data structures based on the Burrows-Wheeler transform of y that can be applied for this computation [10,11]. In many real-world applications of minimal absent words, such as in data compression [12][13][14][15], in sequence comparison [3,9], in on-line pattern matching [16], or in identifying pathogen-specific signatures [17], only a subset of minimal absent words may be considered, and, in particular, the minimal absent words of length (at most) . Since, in the worst case, the number of minimal absent words of y is Θ(σ n), Ω(σ n) space is required to represent them explicitly. In [9], the authors presented an O(n)-sized data structure for outputting minimal absent words of a specific length in optimal time for integer alphabets.
The problem with existing algorithms for computing minimal absent words is that they make use of Ω(n) space and the same amount is required even if one is merely interested in the minimal absent words of length at most . This is because all of these algorithms must construct global data structures, such as the suffix array [6,7], over the whole input. In theory, this problem can be addressed by using the external memory algorithm for computing minimal absent words presented in [18]. The I/O-optimal version of this algorithm, however, requires a large amount of external memory to build the global data structures for the input [19]. One could also use the algorithm of [20] that computes M y in O(n+ M y ) time using O(min{n, z}) space, where z is the size of the LZ77 factorization of y. This algorithm also requires constructing the truncated DAWG, a type of global data structure which could take space Ω(n). Thus, in this paper, we investigate whether M y can be computed efficiently in output-sensitive space.
Our approach consists in decomposing y into a collection of k words, with a suitable overlap of length − 1 between any two consecutive words in the collection.
In fact, the definition of minimal absent word was originally given for languages (sets of words) closed under taking factors (called factorial languages). A word x = aub, with a, b ∈ Σ, is a minimal absent word of a given factorial language L over the alphabet Σ if x does not occur in any of the words of L but there exist y i , y j ∈ L such that au is a factor of y i and ub is a factor of y j . The set of minimal absent words of a word y is precisely the set of minimal absent words of the language Fact(y) of factors of y. That is, M y = M Fact(y) .
More generally, if {y 1 , . . . , y k } is a collection of k words, M {y 1 ,...,y k } is defined as the set of minimal absent words of Fact({y 1 , . . . , y k }) = ∪ k i=1 Fact(y i ). So, a word x = aub, with a, b ∈ Σ, is a minimal absent word of {y 1 , . . . , y k } if and only if x is not a factor of any of the words in the collection but there exist i, j such that au is a factor of y i and ub is a factor of y j .
We have the following lemma: Conversely, let x = aub, with a, b ∈ Σ, be a minimal absent word of length ≤ of {y 1 , . . . , y k }. If x was a factor of y, then since x is not a factor of any of the words in the collection, one has that au is a suffix of some y i , but then because y i and y i+1 have an overlap of length − 1 ≥ |au|, we would have that au is a prefix of y i+1 and it cannot be followed by b since aub does not belong to the set of factors of {y 1 , . . . , y k }. Contradiction.
By Lemma 1, we can state our problem as follows: Problem 1 Given a collection of k words {y 1 , . . . , y k } over an alphabet Σ and an integer > 1, compute the set M {y 1 ,...,y k } of all the words x of length at most , such that x is absent from all the words of the collection while there exists 1 ≤ i, j ≤ k, such that the maximal proper suffix of x is a factor of y i and the maximal proper prefix of x is a factor of y j .
In data compression, this scenario corresponds to computing the antidictionary of k documents [12,13]. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. As discussed above, this computation generally requires Ω(n) space for n = k N=1 |y N |. We do the same computation incrementally using output-sensitive space. This goal is reasonable when M {y 1 ,...,y N } = o(n), for all N ∈ [1, k], where S denotes the sum of the lengths of words in set S. In the human genome, n ≈ 3 × 10 9 but M 12 {y 1 ,...,y k } ≈ 10 6 , where k is the total number of chromosomes.
Our Results Antidictionary-based compressors work on Σ = {0, 1} and in bioinformatics we have Σ = {A, C, G, T}; we thus consider a constant-sized alphabet for stating our results. We consider the word RAM model with w-bit machine words, where w = Ω(log n). We analyze algorithms in the worst case and measure space in terms of machine words. We Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
Paper Organization Section 2 provides the necessary definitions and notation used throughout the paper. In Section 3, we prove several combinatorial properties, which form the basis of our new technique. In Sections 4 and 5, we present our main results. Our experimental results are presented in Section 6. Section 7 concludes the paper with some remarks for further investigation.
A preliminary version of this paper appeared as [1]. Compared to the preliminary version, we have extended the work by adding a simplified space-efficient version of the algorithm (see Section 4). We have also added additional experimental results using real-world datasets, which further substantiate our contribution.

Preliminaries
We generally follow [21]. An alphabet Σ is a finite ordered non-empty set of elements called letters. A word is a sequence of elements of Σ. The set of all words over Σ of length at most is denoted by Σ ≤ . We fix a constant-sized alphabet Σ, i.e., |Σ| = O(1). Given a word y = uxv over Σ, we say that u is a prefix of y, x is a factor (or subword) of y, and v is a suffix of y. We also say that y is a superword of x. A factor x of y is called proper if x = y. We let Fact(y) denote the set of factors of word y.
Given a word y over Σ, the set M y of minimal absent words (MAWs) of y is defined as {aub : a, b ∈ Σ, au and ub are factors of y but aub is not} ∪{c ∈ Σ : c does not occur in y}.
Given a collection of k words y 1 , . . . , y k over Σ, the set M {y 1 ,...,y k } of minimal absent words of the collection {y 1 , . . . , y k } is defined as Fact(y N )} ∪{c ∈ Σ : c does not occur in any of the words y i }.
MAWs of length 1 of y can be easily found with a linear-time constant-space scan, hence in what follows we will only focus on the computation of MAWs of length at least 2.
The suffix tree T (y) of a non-empty word y of length n is the compact trie representing all suffixes of y [21]. The branching nodes of the trie as well as the terminal nodes, that correspond to non-empty suffixes of y, become explicit nodes of the suffix tree, while the other nodes are implicit. We let L(v) denote the path-label from the root node to node v. We say that node v is path-labeled L(v); i.e., the concatenation of the edge labels along the path from the root node to v. Additionally, D(v) = |L(v)| is used to denote the word-depth of node v. A node v such that the path-label L(v) = y[i. .n − 1], for some 0 ≤ i ≤ n − 1, is terminal and is also labeled with index i. Each factor of y is uniquely represented by either an explicit or an implicit node of T (y) called its locus. The suffix-link of a node v with path-label L(v) = aw is a pointer to the node path-labeled w, where a ∈ Σ is a single letter and w is a word. The suffix-link of v exists by construction if v is a non-root branching node of T (y).
The matching statistics of a word x[0. .|x| − 1] with respect to word y is an array [23]. T (y) can be constructed in time O(n), and, given T (y), we can compute MS x in time O(|x|) [23].

Combinatorial Properties
For convenience, we consider the following setting. Let y 1 , y 2 be two words over the alphabet Σ. Let be a positive integer and set M y 1 = M y 1 ∩ Σ ≤ and M y 2 = M y 2 ∩ Σ ≤ . We want to construct M {y 1 ,y 2 } = M {y 1 ,y 2 } ∩ Σ ≤ . Let x ∈ M {y 1 ,y 2 } . We have two cases: The following auxiliary fact follows directly from the minimality property.

Fact 1 Word x is absent from word y if and only if x is a superword of a MAW of y.
For Case 1, we prove the following lemma.
Proof Let x ∈ M y 1 (the case x ∈ M y 2 is symmetric). Suppose first that x is a superword of a word in M y 2 , that is, there exists v ∈ M y 2 such that v is a factor of x. If v = x, then x ∈ M y 1 ∩M y 2 and therefore, using the definition of MAW, x ∈ M {y 1 ,y 2 } .
If v is a proper factor of x, then x is an absent word of y 2 and again, by definition of MAW, x ∈ M {y 1 ,y 2 } . Suppose now that x is not a superword of any word in M y 2 . Then x is not absent in y 2 by Fact 1, and hence in y 3 , thus x cannot belong to M {y 1 ,y 2 } .
It should be clear that the statement of Lemma 2 implies, in particular, that all words in M y 1 ∩ M y 2 belong to M {y 1 ,y 2 } . Furthermore, Lemma 2 motivates us to introduce the reduced set of MAWs of y 1 with respect to y 2 as the set R y 1 obtained from M y 1 after removing those words that are superwords of words in M y 2 . The set R y 2 is defined analogously.
The word bab is contained in M y 1 ∩ M y 2 so it belongs to M {y 1 ,y 2 } . The word aaba ∈ M y 1 is a superword of aba ∈ M y 2 hence aaba ∈ M {y 1 ,y 2 } . On the other hand, the words bbb, aaaa and abb are superwords of words in M y 1 , hence they belong to M {y 1 ,y 2 } . The remaining MAWs are not superwords of MAWs of the other word. The reduced sets are therefore R y 1 = {bb, aaa} and R y 2 = {baab, aba}. In conclusion, we have for Case 1 that We now investigate the set M {y 1 ,y 2 } \ (M y 1 ∪ M y 2 ) (Case 2).
Then au occurs in y 1 but not in y 2 and ub occurs in y 2 but not in y 1 , or vice versa.
The rationale for generating the reduced sets should become clear with the next lemma.
. By Fact 2, au occurs in y 1 but not in y 2 and ub occurs in y 2 but not in y 1 , or vice versa. Let us assume the first case holds (the other case is symmetric). Since au does not occur in y 2 , we have by Fact 1 that there is a MAW x 2 ∈ M y 2 that is a factor of au. Since ub occurs in y 2 , x 2 is not a factor of ub. Consequently, x 2 is a prefix of au.
Analogously, there is an x 1 ∈ M y 1 that is a suffix of ub. Furthermore, x 1 and x 2 cannot be factors one of another.
Inspect Fig. 1 in this regard. Fig. 1 x 2 occurs in y 1 but not in y 2 ; x 1 occurs in y 2 but not in y 1 ; therefore aub does not occur in y 1 #y 2 . By construction, au occurs in y 1 and ub occurs in y 2 ; therefore aub is a Case 2 MAW Example 2 Let y 1 = abaab, y 2 = bbaaab and = 5.
We have that abaa occurs in y 1 but not in y 2 and baaa occurs in y 2 but not in y 1 . Since abaa does not occur in y 2 , there is a MAW x 2 ∈ R y 2 that is a factor of abaa. Since baaa occurs in y 2 , x 2 is not a factor of baaa. So x 2 is a prefix of abaa and this is aba. Analogously, there is a MAW x 1 ∈ R y 1 that is a suffix of abaaa and this is aaa.
As a consequence of Lemma 3, in order to construct the set M {y 1 , In order to construct the final set M {y 1 ,...,y k } , we use incrementally Lemmas 2 and 3. We summarize the whole approach in the following general theorem, which forms the theoretical basis of our technique. These two cases are symmetric, thus only proof of Case A will be presented here.

Space-Efficient Algorithm
In this section, we describe how to transform the combinatorial properties of Section 3 to a space-efficient algorithm for computing the MAWs of a collection of words. Note that we do not analyze the time complexity of this algorithm.
In what follows, we say that word aub, where a and b are letters, is the overlap of the words au and ub.
Consider we have only two words y 1 and y 2 . We apply Lemma 2 to construct the reduced sets R y 1 from M y 1 and R y 2 from M y 2 . Recall that words in R y 1 are factors of y 2 and words in R y 2 are factors of y 1 .
. For every word in R y i , we consider all its occurrences in y j and take all the factors obtained by extending these occurrences to the left up to length − 1. We then obtain the set ← − R y i of ( − 1)-left extensions of words of R y i in y j . Similarly, we define the set − → R y i of ( − 1)-right extensions of words of R y i in y j .
We take aaa from R y 1 , search for it in y 2 , and extend it to the left to get baaa (we underline the letters added in the extension). The word bb cannot be extended further to the left as it appears only at the beginning of Proof It follows directly from Theorem 1.
As a consequence, in order to find the words in M {y i ,y j } \ (M y i ∪ M y j ), we take all the possible words that are obtained as the overlap of an ( − 1)-right extension of a word in R y i with an ( − 1)-left extension of a word in R y j .
It should now be clear how this approach can be generalized to any number of words. We show this in the next example. For the space-efficient algorithm, we will need to consider ( − 1)-extensions of words in the reduced sets one at a time. For this, we define the set of ( −1)-extensions of a single reduced word. Formally, for a word x in R y i , we define ← − X = {wx ∈ Fact(y j ) : |wx| ≤ − 1} and − → X = {xw ∈ Fact(y j ) : |xw| ≤ − 1}. Note that x ∈ ← − X and x ∈ − → X and that ← − R y i = x∈R y i ← − X and − → R y i = x∈R y i − → X . For example, for the word x = aab ∈ R y 3 in the previous example, we have ← − X = {aab, aaab, baab}. We now describe our algorithm. Let T (x) be the suffix tree of word x and T (X) be the generalized suffix tree of the words in set X. Let us further define the following operation over T (x): occ(x, u) returns the starting positions of all occurrences of word u in word x. Let y max N denote the longest word in the collection {y 1 , . . . , y N }.
Let N = 1. We read y 1 from memory, construct T (y 1 ) [21], compute M y 1 [7], and construct T (M y 1 ). We report M y 1 as our first output. The space used thus far is bounded by O(|y 1 At the Nth step, we already have T (M {y 1 ,...,y N −1 } ) in memory from the (N − 1)th step. We read y N from memory and construct T (y N ). The incremental step, for all N ∈ [2, k], works as follows.
We read y N to construct T (y N ). Then we consider one by one each pair (x 1 , x 2 ) of R {y 1 ,...,y N −1 } × R y N . Using T (y N ) we compute P N = occ(y N , x 1 ) and For all i ∈ [1, N − 1] we perform the following: 1. We read y i from memory and construct T (y i ) 2. We compute P i = occ(y i , x 2 ) and the set − → (a) We check the equality of the longest proper suffix of − → x 2 and the longest proper prefix of ← − We are now ready for step N + 1.
We arrive at the following result. Proof The space complexity follows from the above discussion. The correctness follows from Lemmas 2 and 4.
We obtain immediately the following corollary.

Time-Efficient Algorithm
In this section, we provide a time-efficient implementation of the algorithm presented in Section 4. Let us first introduce an algorithmic tool. In the weighted ancestor problem, introduced in [24], we consider a rooted tree T with an integer weight function μ defined on the nodes. We require that the weight of the root is zero and the weight of every non-root node is strictly larger than the weight of its parent. A weighted ancestor query, given a node v and an integer value w ≤ μ(v), asks for the highest ancestor u of v such that μ(u) ≥ w, i.e., such an ancestor u has the property that that μ(u) ≥ w and μ(u) is the smallest possible. When T is the suffix tree of a word y of length n, we can locate the locus of any factor of y[i.
.j ] using a weighted ancestor query. We define the weight of a node of the suffix tree as the length of the word it represents. Thus a weighted ancestor query can be used for the terminal node labeled with i to create (if necessary) and mark the node that corresponds to y[i. .j ].

Theorem 3 ([25]) Given a collection Q of weighted ancestor queries on a weighted tree T on n nodes with integer weights up to n O(1)
, all the queries in Q can be answered off-line in O(n + |Q|) time.

The Algorithm
At the Nth step, we have in memory the set M {y 1 ,...,y N −1 } . Our time-efficient algorithm works as follows: 1. We read word y N from memory and compute M y N in time O(|y N |). We output the words in the previously described constant-space form < i 1 , i 2 , α > such that y N [i 1 . .i 2 ] · α ∈ M y N . 2. Here we compute Case 1 MAWs. We apply Lemma 2 to construct set We create set R {y 1 ,...,y N −1 } explicitly since it is a subset of M {y 1 ,...,y N −1 } . We create set R y N implicitly: every element x ∈ R y N is stored as a tuple < i 1 , i 2 , α > such  (Theorem 3). At this point, we have located the two nodes on T x . We assign a pointer from the stored starting position g of au to the ending position f of ub (see Fig. 4), only if g is before # and f is after # (f can be computed using the stored starting position of ub and the length of ub). Conversely, we assign a pointer from the ending position f of ub to the stored starting position g of au, only if f is before # and g is after #. (e) Suppose au occurs in y i and ub in y N . We make use of the pointers as follows. Recall steps 3 and 4(a) and check whether au starts where a word r 1 of R y N starts and ub ends where a word r 2 of R {y 1 ,...,y N −1 } ends. If this is |y i | = 2k(|y 1 | + · · · + |y k |).
Therefore, the time is bounded by O(kn + k N=1 M {y 1 ,...,y N } ). The space is bounded by the maximum time spent at a single step; namely, the length of the longest word in the collection plus the maximum total size of set elements across all output sets. Note that the total output size of the algorithm is the sum of all its output sets, that is k N=1 M {y 1 ,...,y N } , and MAXOUT could come from any intermediate set.  The correctness of the algorithm follows from Lemma 2 and Theorem 1.

Proof-of-Concept Experiments
In this section, we do not directly compare against the fastest internal [7] or external [18] memory implementations because the former assumes that we have the required amount of internal memory, and the latter assumes that we have the required amount of external memory to construct and store the global data structures for a given input dataset. If the memory for constructing and storing the data structures is available, these linear-time algorithms are surely faster than the method proposed here. In what follows, we rather show that our output-sensitive technique offers a space-time tradeoff, which can be usefully exploited for specific values of , the maximal length of MAWs we wish to compute.
The time-efficient algorithm discussed in Section 5 (with the exception of storing and searching the reduced sets of words explicitly rather than in the constant-space form previously described) has been implemented in the C++ programming language 1 . The correctness of our implementation has been confirmed against that of [7]. We have also implemented the algorithm discussed in Section 4 but as it was significantly slower, its results are omitted from here. As input datasets here we used: the entire human genome (version hg38), which has an approximate size of 3.1GB; the entire mouse genome (version mm10), which has an approximate size of 2.6GB; and the entire chimp genome (version panTro6), which has an approximate size of 2.8GB. All datasets were downloaded from the UCSC Genome Browser [26]. The following experiments were conducted on a machine with an Intel Core i5-4690 CPU at 3.50 GHz and 128GB of memory running GNU/Linux. We ran the program by splitting the genomes into k = 2, 4, 6, 8, 10 blocks and setting = 10, 11, 12.  In accordance to Theorem 4: graph (a) in all figures shows an increase of time as k and increase; and graph (b) in all figures shows a decrease in peak memory as k increases. Notice that the space to construct the block-wise data structures bounds the total space used for the specific values and that is why the memory peak is essentially the same for the values used. This can specifically be seen for = 10 where all words of length 10 are present in all three genomes. The same datasets were used to run the fastest internal memory implementation for computing MAWs [7] on the same machine for = 12. Notice that the algorithm of [7] takes the same time and space irrespective of . It took only 2934 seconds to process the human genome but with a peak memory usage of 75.40GB (it took 2844 seconds to process the mouse genome with a peak memory usage of 66.54GB; and 2731 seconds to process the chimp genome with a peak memory usage of 69.49GB). These results confirm our theoretical findings and justify our contribution.

Final Remarks
We presented a new technique for constructing antidictionaries of long texts in output-sensitive space. Let us conclude with the following remarks: 1. Any space-efficient algorithm designed for global data structures (such as [20]) can be directly applied to the k documents in our technique to further reduce the working space. 2. There is a connection between MAWs and other word regularities [9]. Our technique could potentially be applied to computing these regularities in outputsensitive space. 3. Our technique could serve as a basis for a new parallelization scheme for constructing antidictionaries (see also [27]), in which several documents are processed concurrently.
Funding Open access funding provided by Università degli Studi di Palermo within the CRUI-CARE Agreement.