Efficient Indexes for Jumbled Pattern Matching with ConstantSized Alphabet
 1.1k Downloads
 2 Citations
Abstract
We introduce efficient indexes for a problem in nonstandard stringology: jumbled pattern matching. An index is a data structure constructed for a text of length n over an alphabet of size \(\sigma \) that can answer queries asking if the text contains a fragment which is jumbled (Abelian) equivalent to a pattern, specified by its socalled Parikh vector. We denote the length of the pattern by m. Moosa and Rahman (J Discrete Algorithms 10:5–9, 2012) gave an index for the case of binary alphabets with \(\mathcal {O}\left( \frac{n^2}{(\log n)^2}\right) \)time construction in the wordRAM model. Several earlier papers stated as an open problem the existence of an efficient solution for larger alphabets. In this paper we develop an index for any constantsized alphabet. The construction involves a tradeoff parameter, which in particular lets us achieve the following complexities: \(\mathcal {O}(n^{2\delta })\) space and \(\mathcal {O}(m^{(2\sigma 1)\delta })\) query time for any \(0<\delta <1\), or \(\mathcal {O}\left( \frac{n^2 (\log \log n)^2}{\log n}\right) \) space and polylogarithmic, \(o(\log ^{2\sigma 1} m)\), query time. The construction time in both cases is subquadratic: \(\mathcal {O}\left( \frac{n^2 (\log \log n)^2}{\log n}\right) \) in the wordRAM model (using bitparallelism). Our construction algorithms are randomized (Las Vegas, running time w.h.p.), which is due to the usage of perfect hashing. On the other hand, all queries are answered deterministically. A preliminary version of this work appeared at ESA 2013 (Kociumaka et al. in Algorithms, ESA 2013. LNCS, vol 8125. Springer, Berlin, pp. 625–636, 2013). Here we improve it in several ways. We achieve \(\mathcal {O}(n^2)\)time construction of the index with \(\mathcal {O}(n^{2\delta })\) space and \(\mathcal {O}(m^{(2\sigma 1)\delta })\) query time, which was not present in the preliminary version. We also extend the index so that the position of the leftmost occurrence of the query pattern is provided at no additional cost in the complexity; this required rather nontrivial changes in the construction algorithm.
Keywords
Jumbled indexing Jumbled pattern matching Abelian equivalence Histogram indexing1 Introduction
1.1 The Binary Case
Most results related to indexes for jumbled pattern matching so far have been obtained for binary words. Cicalese et al. [11] proposed an index with \(\mathcal {O}(n)\) size and \(\mathcal {O}(1)\) query time and gave an \(\mathcal {O}(n^2)\)time construction algorithm for the index. The key observation used in this index is that if a binary word contains two factors of length \(\ell \) containing i and j ones respectively, then it must contain a factor of length \(\ell \) with any intermediate number of ones. The index provides only a yes/no answer for a query pattern; additional \(\mathcal {O}(\log n)\) time can be used to restore a witness occurrence [12]. The construction time was improved independently by Burcsi et al. [6] (see also [7, 8]) and Moosa and Rahman [20] to \(\mathcal {O}\left( \frac{n^2}{\log {n}}\right) \), and then by Moosa and Rahman [21] to \(\mathcal {O}\left( \frac{n^2}{(\log {n})^2}\right) \). All these results work in the wordRAM model. For trees vertexlabeled with \(\{0,1\}\) an index, with \(\mathcal {O}\left( \frac{n^2}{(\log {n})^2}\right) \) construction time, \(\mathcal {O}(n)\) size and \(\mathcal {O}(1)\) query time was given in [15]. Hermelin et al. [17] reduced binary jumbled indexing to allpairs shortest paths problem and used the latest results of Williams for the latter problem [23] to obtain preprocessing time of \(\mathcal {O}\left( \frac{n^2}{2^{\Omega ((\log n{/}\log \log n)^{0.5})}}\right) \) for binary jumbled indexing on both words and trees (a similar reduction was shown by Bremner et al. [5]). The general problem of computing an index for jumbled pattern matching in graphs is known to be NPcomplete [13, 19] but fixedparameter tractable by the pattern size [13] (see also [4]).
1.2 Indexes for Larger Alphabets
For arbitrary alphabets, Amir et al. [1] presented an index with \(\mathcal {O}(n^{1+\varepsilon })\) space, \(\mathcal {O}(n^{1+\varepsilon }\log \sigma )\) construction time and \(\mathcal {O}(m^{\frac{1}{\varepsilon }}+\log \sigma )\) query time for any positive \(\varepsilon <1\). Nevertheless, this query time is o(n) only for \(m= o(n^\varepsilon )\). Jumbled pattern matching in a runlength encoded text over arbitrary alphabet was considered in [9].
Amir et al. [2] presented hardness results for jumbled indexing over large alphabets. They showed that, under 3SUMhardness assumption, for \(\sigma = \omega (1)\) jumbled indexing requires \(\Omega (n^{2\epsilon })\) preprocessing time or \(\Omega (n^{1\delta })\) query time for every \(\epsilon ,\delta >0\). Furthermore, under strong 3SUMhardness assumption, for \(\sigma \ge 3\) jumbled indexing requires \(\Omega (n^{2\epsilon _\sigma })\) preprocessing time or \(\Omega (n^{1\delta _\sigma })\) query time, where \(\epsilon _\sigma ,\delta _\sigma <1\) are computable constants. Recall that the 3SUM problem asks if one can choose elements \(a\in A\), \(b\in B\), and \(c\in C\) from given integer sets A, B, C so that \(a+b = c\). It is believed that this problem cannot be solved in strongly subquadratic time; for precise formulations of the related hardness assumptions; see [2].
Several researchers (see, e.g., [8, 20, 21]) posed an open problem asking for a construction of an \(o(n^2)\) indexing scheme with o(n) query time for general alphabets. In particular, even for a ternary alphabet none was known, since the basic observation used to obtain a binary index is not applicable to any larger alphabet.
1.3 Our Results
We prove that the answer for the open problem asking for a subquadratic jumbled index with sublineartime queries is positive for any constantsized alphabet. We show an index of size \(\mathcal {O}\left( \frac{n^2}{L}\right) \) answering queries in \(\mathcal {O}(L^{2\sigma 1})\) time where L is a tradeoff parameter which can attain any given value between 1 and n. For some choices of L we also improve the query time so that it depends on the pattern size m only. More precisely, we show an index of size \(\mathcal {O}\left( \frac{n^2(\log \log n)^2}{\log {n}}\right) \) which enables queries in \(\mathcal {O}\left( \left( \frac{\log m}{(\log \log m)^2}\right) ^{2\sigma 1}\right) \) time, and for any \(0 < \delta < 1\) an index of size \(\mathcal {O}(n^{2\delta })\) with \(\mathcal {O}(m^{\delta (2\sigma 1)})\)time queries. Both these variants take \(\mathcal {O}\left( \frac{n^2(\log \log n)^2}{\log {n}}\right) \) time to construct, and the query algorithm provides the leftmost occurrence of the pattern if it exists. Our index works in the wordRAM model with word size \(\Omega (\log n)\) [16], and the construction algorithm uses bitparallelism. The construction algorithm is randomized (Las Vegas, running time w.h.p.) due to perfect hashing. On the other hand, all query algorithms are deterministic.
After the submission of this journal paper, Chan and Lewenstein [10] presented a breakthrough work where they improved the construction time of the binary index to \(\mathcal {O}(n^{1.859})\). They also obtained an index with strongly subquadratic construction and strongly sublinear queries for larger alphabets. Moreover, Chan and Lewenstein provide an extension of our idea of heavy and light factors which improves query time in the index.
1.4 Organization of the Paper
Section 2 is devoted mostly to combinatorial properties of Parikh vectors. In Sect. 3 we describe the basic version of the index together with the query algorithm. The size of the index is \(\mathcal {O}\left( \frac{n^2}{L}\right) \) and the queries are answered in \(\mathcal {O}(L^{2\sigma 1})\) time. We also present a naive \(\mathcal {O}(n^2)\)time construction algorithm. In Sect. 4 we introduce variants of the index whose query time depends on the pattern size rather than the text size. Using the auxiliary tools based on bit parallelism in the wordRAM model that we provide in Sect. 5, in Sect. 6 we develop a subquadratictime construction algorithm for the index. Section 7 contains some concluding remarks.
2 Preliminaries
In this paper we assume that the alphabet \(\Sigma \) is \(\{1,2,\ldots ,\sigma \}\) for \(\sigma =\mathcal {O}(1)\). Let \(x\in \Sigma ^n\). By \(x_i\) (for \(i \in \{1,\ldots ,n\}\)) we denote the ith letter of x. A word of the form \(x_i \ldots x_j\), also denoted as \(x[i \ldots j]\), is called a factor of x. We say that the factor \(x[i\ldots j]\) occurs at the position i. A factor of the form \(x[1\ldots i]\) is called a prefix of x. If \(i>j\) then \(x[i\ldots j]\) represents an empty word.
Example 2.1
\(\mathcal {P}(1\,2\,3\,1\,2\,3\,4\,3\,2\,1\,3\,4\,1\,2\,2\,3\,1\,3\,4\,3)= (5,\,5,\, 7,\, 3)\).
Example 2.2
\(1\,2\,2\,1 \approx 2\,2\,1\,1\), since \(\mathcal {P}(1\,2\,2\,1)=(2,2)=\mathcal {P}(2\,2\,1\,1)\).
For a fixed word x, we define \( Occ (p)\) as the set of the starting positions of all occurrences of factors of x with Parikh vector p. For the zero Parikh vector \(\bar{0}_\sigma \) we assume that \( Occ (\bar{0}_\sigma )=\{1,\ldots ,n+1\}\). If \( Occ (p)\ne \emptyset \), we say that p occurs in x (at each position \(i\in Occ (p)\)), or that p is an Abelian factor of x.
Example 2.3
Lemma 2.4
 (a)
\( Ext ^+_{= r}(p),  Ext ^_{= r}(p) \le 2r^{\sigma 1}\);
 (b)
\( Ext ^+_{< r}(p) \le r^\sigma \).
Proof
Let us also introduce an efficient tool for determining Parikh vectors of factors of a given word.
Lemma 2.5
A text x of length n can be preprocessed in \(\mathcal {O}(n)\) time so that for any \(1 \le i \le j \le n\), the Parikh vector \(\mathcal {P}(x[i \ldots j])\) can be computed in \(\mathcal {O}(1)\) time.
Proof
3 Index with SublinearTime Queries
In this section we describe an index for jumbled pattern matching which has subquadratic size and allows sublineartime queries. We also show a simple \(\mathcal {O}(n^2)\)time construction of the index. Let us fix a tradeoff parameter \(L \in \{1,\ldots ,n\}\); the space of the index and the query time will depend on L.
3.1 A Sketch of the Algorithm
3.1.1 The “Light” Case
If p is a light Lfactor, then we can afford to iterate through all its occurrences in x. To find the position i we use the following observation:
Observation 3.1
\(p \in Ext ^_{= r}(q)\) and \(i \in Occ (p)\).
Hence, the query for q is answered by iterating through all possible generating Lfactors \(p \in Ext ^_{= r}(q)\), filtering out those which are not light Lfactors, and then iterating through all elements \(i \in Occ (p)\). Lemma 2.4(a) limits the size of the set \( Ext ^_{= r}(q)\) and the size of the set \( Occ (p)\) is bounded due to the fact that p is light.
3.1.2 The “Heavy” Case
For each heavy Lfactor p we store all Abelian factors generated by it (this set is denoted as \(\mathcal {D}_L(p)\)), together with their leftmost occurrences generated by p. To answer a query for a Parikh vector q in this case, we simply return the precomputed answer.
We need to argue that the space used here is small enough. A single Lfactor p generates at most \( Ext ^+_{< L}(p)\) different Abelian factors. Even though a heavy Lfactor has many occurrences in x, the total number of occurrences of heavy Lfactors is still \(\mathcal {O}\left( \frac{n^2}{L}\right) \). Hence, we can afford \(\mathcal {O}(\sum _{p \mathrm{heavy}}  Occ (p))\) space. If we pick the threshold value on the number of occurrences of light/heavy Lfactors so that every heavy Lfactor p satisfies the condition \( Occ (p) \ge  Ext ^+_{< L}(p)\), then we can indeed afford to store all Abelian factors generated by each heavy Lfactor.
In Sect. 3.2 we formally define the notions of light and heavy Lfactors and the set \(\mathcal {D}_L\). Then in Sect. 3.3 we give a full description of the data structure and the query algorithm. A simple construction of the index (subject to improvement in the later sections) is shown in Sect. 3.4. Finally in Sect. 3.5 we analyze the complexity of the index depending on the tradeoff parameter L.
3.2 Combinatorial Tools
Fact 3.2
\(\sum _{p\in \mathcal {F}_L}  Occ (p)\le \frac{n^2}{L}+n+1\).
Proof
Example 3.3
Example 3.4
For the text x \(\, =\, \)2 2 2 1 1 1 2 1 2 1 1 2 1 2 2 1 1 2 1 2 1 from Example 3.3 (\(L=3\)) we have \(\mathcal {D}_L((2,1)) = \{(2,1), (3,1), (3,2), (2,2), (2,3)\}\) and the corresponding positions \( Pos _{(2,1)}(q)\) are: \( Pos _{(2,1)}((2,1)) = Pos _{(2,1)}((3,1)) = Pos _{(2,1)}((3,2)) = 3\), \( Pos _{(2,1)}((2,2))=6\), and \( Pos _{(2,1)}((2,3))=11\); see Fig. 3.
3.3 Data Structure and Queries
 (a)
a dictionary with keys \(p \in \mathcal {L}_L\) and values \( Occ (p)\),
 (b)
a dictionary with keys \(q \in \mathcal {D}_L\) and values \( Pos _{\mathcal {H}_L}(q)\),
 (c)
the data structure of Lemma 2.5 to retrieve the Parikh vectors of factors.
Lemma 3.5
The size of INDEX\(_L(x)\) is \(\mathcal {O}\bigl (\frac{n^2}{L}\bigr )\).
Proof
The size of part (a) of the index is \(\sum _{p\in \mathcal {L}_L}  Occ (p)\). By Fact 3.2, this is at most \(\frac{n^2}{L} + n + 1 = \mathcal {O}\left( \frac{n^2}{L}\right) \).
The size of part (b) is \(\mathcal {D}_L\). To bound the size of this set we use the following claim.
Claim
For every \(p \in \mathcal {H}_L\), \(\mathcal {D}_L(p) <  Occ (p)\).
Proof
Finally, the size of part (c) of the index is \(\mathcal {O}(n)\) due to Lemma 2.5. Hence, the whole index uses \(\mathcal {O}(n^2{/}L)\) space. \(\square \)
Lemma 3.6
The query time in INDEX\(_L(x)\) is \(\mathcal {O}(L^{2\sigma 1})\).
Proof
The query algorithm works as presented in the pseudocode above. Let us bound its running time. By Lemma 2.4(a), \( Ext ^_{= r}(q) = \mathcal {O}(r^{\sigma 1}) = \mathcal {O}(L^{\sigma 1})\). All elements \(p \in Ext ^_{= r}(q)\) can be listed in \(\mathcal {O}(L^{\sigma 1})\) time. For each \(p \in \mathcal {L}_L\), by definition, \( Occ (p) \le L^\sigma \). Thus, with constanttime equivalence queries of Lemma 2.5, we obtain the desired \(\mathcal {O}(L^{2\sigma 1})\) query time. \(\square \)
3.4 Simple Construction of the Index
Lemma 3.7
INDEX\(_L(x)\) can be constructed in \(\mathcal {O}(n^2)\) time w.h.p.
Proof
As the first step we construct \(\mathcal {F}_L\) together with \( Occ (p)\) for each \(p\in \mathcal {F}_L\). We store \(\mathcal {F}_L\) as a dictionary (hash table with perfect hashing). This component can be constructed in \(\mathcal {O}\left( \frac{n^2}{L}\right) \) time using Lemma 2.5. Moreover, it allows to compute \(\mathcal {L}_L\) and \(\mathcal {H}_L\) in the same time.
We construct \(\mathcal {D}_L\) using the algorithm from the pseudocode. With the aid of Lemma 2.5 it runs in \(\mathcal {O}(n^2)\) time. Randomization of the construction is due to perfect hashing used to implement the dictionaries in the index. \(\square \)
3.5 Complexity of the Index
The following theorem summarizes the results of the previous subsection.
Theorem 3.8
For any text of length n and any integer \(1 \le L \le n\) there exists an index for jumbled pattern matching of size \(\mathcal {O}\left( \frac{n^2}{L}\right) \) and with \(\mathcal {O}(L^{2\sigma 1})\) query time. The index can be constructed in \(\mathcal {O}(n^2)\) time w.h.p.
For some particular values of the tradeoff parameter L we obtain particularly useful indexes.
Corollary 3.9
For any text of length n one can construct in \(\mathcal {O}(n^2)\) time an index for jumbled pattern matching of size \(\mathcal {O}\left( \frac{n^2 (\log \log n)^2}{\log n}\right) \) which answers queries in \(\mathcal {O}\left( \left( \frac{\log n}{(\log \log n)^2}\right) ^{2\sigma 1}\right) \) time.
Proof
We take \(L = \left\lceil \frac{\log n}{(\log \log n)^2} \right\rceil \) and apply Theorem 3.8. \(\square \)
Corollary 3.10
For any text of length n and any \(0<\delta <1\) one can construct in \(\mathcal {O}(n^2)\) time an index for jumbled pattern matching of size \(\mathcal {O}(n^{2\delta })\) which answers queries in \(\mathcal {O}(n^{(2\sigma 1)\delta })\) time.
Proof
We take \(L = \lceil n^{\delta } \rceil \) and apply Theorem 3.8. \(\square \)
In Sect. 6 we show that the data structure from Theorem 3.8 can be constructed in \(\mathcal {O}\left( \max \left( \frac{n^2}{L},\frac{n^2 (\log \log n)^2}{\log n}\right) \right) \) time. This yields \(\mathcal {O}\left( \frac{n^2 (\log \log n)^2}{\log n}\right) \)time construction in the special cases of both corollaries. However, first we improve the query time.
4 Faster Queries for Small Patterns
While \(\mathcal {O}(n^{(2\sigma 1)\delta })\) is sublinear in n for small \(\delta \), it is still rather large, and, especially for very small patterns, might be considered unsatisfactory. We modify the data structure to handle such patterns much more efficiently, in \(\mathcal {O}(m^{(2\sigma 1)\delta })\) time for patterns of norm m. We start with an auxiliary data structure.
Lemma 4.1
For any text of length n and any integers L, k such that \(1\le L \le k \le n\), there exists an index for jumbled pattern matching of size \(\mathcal {O}\left( \frac{nk}{L}\right) \) and with \(\mathcal {O}(L^{2\sigma 1})\) query time for patterns of norm at most k. The index can be constructed in \(\mathcal {O}(nk)\) time w.h.p.
Proof
We repeat the proof of Theorem 3.8 but we restrict to Lfactors of norm at most k. Let \(\mathcal {F}_{L,k}\) denote the set of these factors. Similarly as in the case of \(\mathcal {F}_L\), we obtain \(\mathcal {F}_{L,k} \le \frac{n\cdot k}{L}\). The rest of the construction is the same as before. \(\square \)
Theorem 4.2
For any text of length n and any \(0<\delta <1\) there exists an index for jumbled pattern matching of size \(\mathcal {O}(n^{2\delta })\) and with \(\mathcal {O}(m^{(2\sigma 1)\delta })\) query time, where m is the norm of the pattern. The index can be constructed in \(\mathcal {O}(n^2)\) time w.h.p.
Proof
A similar argument gives an improvement of Corollary 3.9.
Theorem 4.3
For any text of size n there exists an index for jumbled pattern matching of \(\mathcal {O}\left( \frac{n^2 (\log \log n)^2}{\log n}\right) \) size and with \(\mathcal {O}\left( \left( \frac{\log m}{(\log \log m)^2}\right) ^{2\sigma 1}\right) \) query time, where m is the norm of the pattern. The index can be constructed in \(\mathcal {O}(n^2)\) time w.h.p.
Proof
In Sect. 6 we show that the data structures of Theorems 4.2 and 4.3 can actually be constructed in \(\mathcal {O}\left( \frac{n^2(\log \log n)^2}{\log n}\right) \) time.
5 Efficient Element Location in Packed Lists
In this section we consider an auxiliary problem of finding the first occurrence for each distinct value occurring in a given list. We focus on lists of length greater than the size of the universe of values and we aim at sublinear time in the length of the lists, which requires a suitable compact representation of these lists. The algorithm developed in this section is later used to obtain a subquadratictime construction of our index for jumbled pattern matching.
Let w be a lower bound on the machine word size of the wordRAM machine and let \(U=\{0,\ldots ,N1\}\) for \(N\le 2^w\) be the universe. Each element of the universe can be stored in binary using \(B=\left\lceil \log N \right\rceil \) bits, and therefore a single machine word can fit up to \(M=\lfloor \frac{w}{B}\rfloor \) elements one after the other. Such a sequence of up to M elements stored in a single word is called a short list, often denoted as \(\ell \) in this section. If the universe U is binary, i.e., if \(U=\{0,1\}\), we refer to short lists as short bitmasks.
Note that the short list does not store its length and thus in general we cannot, for example, tell the difference between a given list of length \(m<M\) and one of the valid lists of length M. Consequently, we need to know the length m to interpret the encoding. The following fact describes how short lists can be used to store lists of larger length m in roughly \(\frac{m}{M}\) machine words. The resulting data structure is called here a packed list.
Fact 5.1
 Push

append a given short list at the end of \(\mathbf {L}\),
 Pop

return a short list of \(m^{\prime }\), \(m^{\prime }\le \min (m,M)\), first elements of \(\mathbf {L}\) and remove those elements from \(\mathbf {L}\),
 Length

return the length m of \(\mathbf {L}\).
Proof

concatenate two short lists of lengths m and \(m^{\prime }\) (with \(m+m^{\prime }\le M\)),

retrieve a short list of \(m^{\prime }\) heading or trailing elements of a given short list of length m (\(m^{\prime }\le m\le M\)).
The implementation of the Length query is trivial and so is updating the length m under Push and Pop operations.
To append a short list \(\ell \) of \(m^{\prime }\) elements, we extract the first \(\min (Mm_e,m^{\prime })\) elements of \(\ell \) and concatenate the resulting short list with the tail \(\ell _e\) of the queue. If \(m^{\prime }>Mm_e\), we also extract the remaining elements of \(\ell \) and append the resulting short list as the new tail of the queue. The length \(m_e\) needs to be updated accordingly (so needs \(m_b\) if the queue had length at most one).
The implementation of the operation \( Pop \) is similar. First, we extract the first \(\min (m^{\prime },m_b)\) elements from the head \(\ell _b\) and then replace \(\ell _b\) with its remaining elements. If \(m^{\prime }< m_b\), we are done and we return the extracted short list. Otherwise, \(\ell _b\) becomes empty, so we drop it from the queue and remove \(m^{\prime }m_b\) first elements from the new head. The resulting list is formed by concatenation of the two extracted short lists. \(\square \)
For a packed list \(\mathbf {L}\) we define a bitmask \( FirstOcc _{\mathbf {L}}\) of first occurrences. Its length is the same as the length of \(\mathbf {L}\) and the ith bit of \( FirstOcc _{\mathbf {L}}\) is set if the value at the ith position in \(\mathbf {L}\) has no earlier occurrence in \(\mathbf {L}\). The bitmask is stored as a packed list over the universe \(U=\{0,1\}\).
Example 5.2
The following lemma is the main result of this section.
Lemma 5.3
After \(\mathcal {O}(2^{2w}w)\)time preprocessing, for any packed list \(\mathbf {L}\), the bitmask \( FirstOcc _{\mathbf {L}}\) can be computed in \(\mathcal {O}\left( 1+\frac{m(\log N)^2}{w}\right) \) time, where \(m= Length (\mathbf {L})\).
Proof
We apply a divideandconquer algorithm to solve the problem. We partition \(\mathbf {L}\) into sublists of elements not exceeding p (\(\mathbf {L}_{\le p}\)) and larger than p (\(\mathbf {L}_{>p}\)) for some pivot value p, recurse on \(\mathbf {L}_{\le p}\) and \(\mathbf {L}_{>p}\), and retrieve \( FirstOcc _{\mathbf {L}}\) from \( FirstOcc _{\mathbf {L}_{\le p}}\) and \( FirstOcc _{\mathbf {L}_{>p}}\). This approach is similar to efficient wavelet tree construction for sequences over a small universe; see [3, 22].
For a short list \(\ell \) and an integer \(p\in U\) we define an operation of partitioning \(\ell \) with p as a pivot. We denote \( Partition (\ell ,p)=(\ell _{\le p},\ell _{>p},b_{\ell ,p})\), where \(\ell _{\le p}\) and \(\ell _{>p}\) are both short lists, \(\ell _{\le p}\) is a sublist of \(\ell \) consisting of elements not exceeding p, while \(\ell _{>p}\) is the complement sublist of \(\ell \), and \(b_{\ell ,p}\) is a short bitmask of the same length as \(\ell \) in which zeroes correspond to elements of \(\ell \) not exceeding p and ones to the remaining elements. For the \( Partition (\ell ,p)\) queries we additionally assume that length of \(\ell \) is given in the input, and lengths of \(\ell _{\le p}\) and \(\ell _{>p}\) are returned as a part of the output. The answer to a single query can be naively computed in linear time with respect to the length of the list, i.e., in \(\mathcal {O}(M)=\mathcal {O}(w)\) time. Since the number of distinct pairs \((\ell , p)\) does not exceed \(\mathcal {O}(2^{w}\cdot 2^w)=\mathcal {O}(2^{2w})\), the answers to all \( Partition (\ell ,p)\) queries can be precomputed in \(\mathcal {O}(2^{2w}w)\) time.
Observation 5.4
\(\textsc {LargePartition}(\mathbf {L},p)\) works in \(\mathcal {O}\left( 1+\frac{\mathbf {L} \log N}{w}\right) \) time.
Next, we show how to retrieve the desired \( FirstOcc _{\mathbf {L}}\) from \( FirstOcc _{\mathbf {L}_{\le p}}\) and \( FirstOcc _{\mathbf {L}_{> p}}\). Conceptually, \( FirstOcc _{\mathbf {L}}\) can be easily obtained using the auxiliary bitmask \(\mathbf {B}\): it suffices to transform \(\mathbf {B}\) so that the ith 0 in \(\mathbf {B}\) is replaced by the ith bit from \( FirstOcc _{\mathbf {L}_{\le p}}\), while the ith 1 in \(\mathbf {B}\) is replaced by the ith bit from \( FirstOcc _{\mathbf {L}_{> p}}\).
For an efficient implementation we use the following auxiliary operation \( Merge (b,f_0,f_1)\): given a short bitmask b (of length \(m\le w\)), with \(m_1\) set bits, and a pair of bitmasks: \(f_0\) of length \(mm_1\) and \(f_1\) of length \(m_1\), replace zeroes in b with consecutive bits of \(f_0\), and ones in b with consecutive bits of \(f_1\). Note that there are \(\mathcal {O}(2^{2w})\) possible inputs, so the answers can be precomputed in \(\mathcal {O}(2^{2w}w)\) time. We also precompute for each short bitmask b the number of set bits [denoted as \( PopCount (b)\)].
Observation 5.5
\(\textsc {LargeMerge}(\mathbf {L},p)\) works in \(\mathcal {O}\left( 1+\frac{\mathbf {B}}{w}\right) \) time.
We conclude the proof with the analysis of the running time. The preprocessing time is \(\mathcal {O}(2^{2w}w)\) as required. If the initial list is short, the procedure clearly runs in \(\mathcal {O}(1)\) time. In the discussion below we ignore this special case and assume the initial length m is greater than M.
A single call to the \(\textsc {LargeFirstOcc}\) procedure, excluding the recursive calls it makes, takes \(\mathcal {O}\left( 1+\frac{\mathbf {L}\log N}{w}\right) \) time. The \(\mathcal {O}(1)\) term may dominate only in the leaves of the recursion tree. Therefore, unless the root of the tree is a leaf, we account the \(\mathcal {O}(1)\) time in the amortized running time of the parent call. The procedure makes at most two recursive calls, so the amortized running time becomes \(\mathcal {O}\left( \frac{\mathbf {L}\log N}{w}\right) \). The depth of the recursion is bounded by \(\lceil \log N\rceil \) since the interval \(\{r_b,\ldots ,r_e\}\) is halved at each step. Moreover, the lists in a single level of recursion tree are of total length m. This gives \(\mathcal {O}\left( \frac{m\log N}{w}\right) \) amortized time per level and \(\mathcal {O}\left( \frac{m(\log N)^2}{w}\right) \) time in total. Accounting for the \(\mathcal {O}(1)\)time in the border case, we get the announced \(\mathcal {O}\left( 1+\frac{m(\log N)^2}{w}\right) \) time complexity. \(\square \)
Corollary 5.6
After \(\mathcal {O}(2^{2w}w)\)time preprocessing, given a packed list \(\mathbf {L}\) of length m we can compute, for each value \(j\in U\), the position of the first occurrence of j in \(\mathbf {L}\) (or \(\infty \) if no such position exists) in \(\mathcal {O}\left( N+\frac{m(\log N)^2}{w}\right) \) time.
Proof
We apply Lemma 5.3 to obtain \(\mathbf {F}= FirstOcc _{\mathbf {L}}\) in \(\mathcal {O}\left( 1+\frac{m(\log N)^2}{w}\right) \) time. Next, we initialize the output array in \(\mathcal {O}(N)\) time to \(\infty \)’s, and iterate through the set bits of \(\mathbf {F}\) in \(\mathcal {O}\left( N+\frac{m}{w}\right) \) time. Whenever we encounter a set bit at position j, we retrieve \(i=\mathbf {L}[j]\) and set the ith position of the output array to j. \(\square \)
6 Reducing Preprocessing Time
In Sect. 3 we presented an index of subquadratic size allowing for sublineartime queries. However, the construction time was quadratic. Here, we slightly improve this parameter.
Recall that the only bottleneck of the simple construction algorithm, developed in Sect. 3.4, is computing the set \(\mathcal {D}_L\) (of Abelian factors generated by heavy Lfactors) and the witness positions \( Pos _{\mathcal {H}_L}(q)\). Our improved solutions process each \(p\in \mathcal {H}_L\) separately, determining \(\mathcal {D}_L(p)\) and \( Pos _p(q)\) for all \(q\in \mathcal {D}_L(p)\).
First, in Theorem 6.1, we actually deal with small values of L using bitparallelism of the wordRAM model. Our approach is as follows: We assign each \(q\in Ext ^+_{< L}(p)\) a short integer identifier of \(\mathcal {O}(\sigma \log L)\) bits and for each \(i\in Occ (p)\) we compute a short list representing those extensions \(q\in Ext ^+_{< L}(p)\) for which \(i\in Occ _p(q)\). Then, we concatenate these short lists into a single packed list, apply Corollary 5.6, and translate the first occurrences of each identifier in the packed list into the leftmost occurrences \( Pos _p(q)\) of each \(q\in Ext ^+_{< L}(p)\). Provided that \(L=\mathcal {O}\left( \frac{\log n}{(\log \log n)^2}\right) \), this gives a factorL speedup.
For larger L, in Theorem 6.3, we use a twolevel procedure, which introduces an auxiliary parameter \(\ell \approx \frac{\log n}{(\log \log n)^2}\) and generates \(\mathcal {D}_L(p)\) in \(\frac{L}{\ell }\) phases, using the same techniques as before to obtain a factor\(\ell \) speedup.
Theorem 6.1
The index of Theorem 3.8 admits a construction algorithm running in \(\mathcal {O}\left( \frac{n^2}{L} + \frac{n^2(\log L)^2}{\log n}\right) \) time w.h.p.
Proof
First, observe that if \(\log L\ge {\sqrt{\log n}}\) the claimed construction time is quadratic since \(\frac{n^2(\log L)^2}{\log n}=\Omega (n^2)\). Thus, in the following we assume \(\log L < \sqrt{\log n}\), which in particular implies \(\log L=o(\log n)\).
We apply the results of Sect. 5 with \(w=\alpha \log n\) for some constant \(\alpha < \frac{1}{2}\), so that the preprocessing time is \(\mathcal {O}(2^{2w}w)=\mathcal {O}(n^{2\alpha }\log n)=o(n)\).
As described in the proof of Lemma 3.7, all parts of the preprocessing excluding the computation of \(\mathcal {D}_L\) work in \(\mathcal {O}\left( \frac{n^2}{L}\right) \) time. The missing component is constructed using Corollary 5.6.
Let us consider \( Ext ^+_{< L}(\bar{0}_\sigma )\), the set of Parikh vectors e satisfying \(e<L\). These Parikh vectors can be interpreted as lists of length \(\sigma \) over \(\{0,\ldots ,L1\}\), and thus they can be represented as integers within \(U=\{0,\ldots ,N1\}\) for \(N=L^\sigma \), which shall be the universe for the packed lists. Here, entries of the Parikh vector correspond to digits in the baseL representation of its identifier.
Note that these integer identifiers fit into a machine word since \(L^\sigma = 2^{\sigma \log L}=2^{o(\log n)}\) while \(w=\Theta (\log n)\). For each position i, \(1\le i \le n\), we construct a packed list \(\mathbf {L}_i\) of length \(\min (L,n+2i)\) whose jth element (0based) is the identifier of \(\mathcal {P}(x[i \ldots i+j1])\). A single list can be constructed in \(\mathcal {O}(L)\) time, which gives \(\mathcal {O}(nL)\) time in total. Each \(\mathbf {L}_i\) is stored in \(\mathcal {O}\left( 1+\frac{L\log N}{w}\right) \) space.
Now sets \(\mathcal {D}_L(p)\) are computed separately for each \(p \in \mathcal {H}_L\). Note that along with any \(q\in \mathcal {D}_L(p)\) we need to find the leftmost occurrence \( Pos _p(q)\). Observe that for any \(i\in Occ _p(q)\) we have \(\mathcal {P}(x[i+p \ldots i+p+r1])=qp\) where \(r=qp\in \{0,\ldots ,L1\}\).
We consider all positions \(i\in Occ (p)\), and for each such position i we take the list \(\mathbf {L}_{i+p}\). To compute \(\mathcal {D}_L(p)\), it suffices to find all the distinct elements in these lists \(\mathbf {L}_{i+p}\) and add p to each of the corresponding Parikh vectors. For this, we concatenate the corresponding lists \(\mathbf {L}_{i+p}\) into a single packed list \(\mathbf {L}\) of length not exceeding \( Occ (p)\cdot L\), and run the algorithm of Corollary 5.6. For each identifier occurring in \(\mathbf {L}\), we retrieve its first position in \(\mathbf {L}\) which can be translated into the position \( Pos _p(q)\) of the corresponding occurrence of \(q\in \mathcal {D}_L(p)\). The latter is easy if we store \( Occ (p)\) as a sorted array and concatenate \(\mathbf {L}_{i+p}\) in this order.
For small L we obtain a corollary which basically states that we have an optimal construction time of our data structure, since its running time matches the space complexity of the index.
Corollary 6.2
If \(L = \mathcal {O}\left( \frac{\log n}{(\log \log n)^2}\right) \), then the index of Theorem 3.8 can be constructed in \(\mathcal {O}\left( \frac{n^2}{L}\right) \) time w.h.p. In particular, the index of Corollary 3.9 can be constructed in \(\mathcal {O}\left( \frac{n^2(\log \log n)^2}{\log n}\right) \) time w.h.p.
Next, we generalize the algorithm to extend the scope of its usefulness for larger L.
Theorem 6.3
If \(L=\Omega \left( \frac{\log n}{(\log \log n)^2}\right) \), then the index of Theorem 3.8 can be constructed in \(\mathcal {O}\left( \frac{n^2(\log \log n)^2}{\log n}\right) \) time w.h.p. In particular, the index of Corollary 3.10 can be constructed in \(\mathcal {O}\left( \frac{n^2(\log \log n)^2}{\log n}\right) \) time w.h.p.
Proof
Let \(\ell =\lceil \frac{\log n}{(\log \log n)^2}\rceil \). If \(L < \ell ^{\sigma +1}\), then already Theorem 6.1 gives the desired complexity bound. Thus, in the following we assume that \(L\ge \ell ^{\sigma +1}\).
Again, we shall concentrate on computing sets \(\mathcal {D}_L(p)\) for each \(p\in \mathcal {H}_L\) and the witness occurrences \( Pos _p(q)\) for each \(q\in \mathcal {D}_L(p)\). As in the proof of Theorem 6.1, we assign integer identifiers to short Abelian factors, and precompute lists \(\mathbf {L}_i\) for each position i. This time, however, we perform these operations for \(\ell \) instead of L. Consequently, we set \(N= \ell ^\sigma \) and the lists \(\mathbf {L}_i\) take \(\mathcal {O}(n\ell )\) time to construct.
For each such list of length at most \(\ell ^\sigma \) we naively scan the occurrences and the respective extensions. This takes \( Occ _p(q)\ell = \mathcal {O}(\ell ^{\sigma +1})\) time per list, which is at most \(\mathcal {O}( Ext ^+_{= (k1)\ell }(p) \ell ^{\sigma +1}) = \mathcal {O}(L^{\sigma 1}\ell ^{\sigma +1})\) for the whole phase.
Finally, we observe that the construction of the index of Lemma 4.1 can also be improved. We only need to make sure that the \(\mathcal {O}(nL)\) term (hidden in the construction of Theorem 6.1) or the \(\mathcal {O}(n\ell )\) term (in the proof of Theorem 6.3) do not dominate the running time. For this, it suffices to allow for additional \(\mathcal {O}(n\log ^{\mathcal {O}(1)}n)\) term in the running time, since we use the approach of Theorem 6.1 only for \(L < \log ^{\sigma +1} n\), and in Theorem 6.3 we have \(l<\log ^{\sigma +1}n\).
Lemma 6.4
The index of Lemma 4.1 can be constructed in \(\mathcal {O}\left( n\log ^{\sigma +1}n +\right. \left. \max \left( \frac{nk}{L},\frac{nk(\log \log n)^2}{\log n}\right) \right) \) time w.h.p.
Corollary 6.5
The indexes of Theorems 4.2 and 4.3 can be constructed in \(\mathcal {O}\left( \frac{n^2(\log \log n)^2}{\log n}\right) \) time w.h.p.
Proof
In the proofs of Theorems 4.2 and 4.3, we use only \(\mathcal {O}(\log n)\) instances of the data structure of Lemma 4.1, so the additional \(\mathcal {O}(n\log ^{\mathcal {O}(1)}n)\) term is dominated by the claimed construction time. The \(\mathcal {O}\left( \frac{nk}{L}\right) \) term sums up to the data structure size, which is also dominated. Finally, it suffices to note that \(\sum \nolimits _{k\in K} k = \mathcal {O}(n)\), so the \(\mathcal {O}\left( \frac{nk(\log \log n)^2}{\log n}\right) \) terms sum up to \(\mathcal {O}\left( \frac{n^2(\log \log n)^2}{\log n}\right) \), as desired. \(\square \)
7 Conclusions
We presented several versions of an index for jumbled pattern matching in a text over a constantsized alphabet. The index admits a size versus query time tradeoff, which in particular gives a data structure of size \(\mathcal {O}\left( \frac{n^2(\log \log n)^2}{\log {n}}\right) \) with \(\mathcal {O}\left( \left( \frac{\log m}{(\log \log m)^2}\right) ^{2\sigma 1}\right) \) query time, and a solution of size \(\mathcal {O}(n^{2\delta })\) with \(\mathcal {O}(m^{(2\sigma 1)\delta })\) query time for any \(0 < \delta < 1\). Thus the index is able to provide polylogarithmic query time and subquadratic space, or strongly sublinear query time along with strongly subquadratic space. Both versions of the index can be constructed in \(\mathcal {O}\left( \frac{n^2(\log \log n)^2}{\log {n}}\right) \) time with high probability under the wordRAM model. Moreover, the query algorithm computes the leftmost occurrence of the query pattern if it exists.
Recall that for a constant alphabet of size \(\sigma \ge 3\), in [2] it is shown that, under strong 3SUMhardness assumption, jumbled indexing requires \(\Omega (n^{2\epsilon _\sigma })\) preprocessing time or \(\Omega (n^{1\delta _\sigma })\) query time, where \(\epsilon _\sigma ,\delta _\sigma <1\) are computable constants. This leaves room for improvement of the construction time of an index, and also does not apply to the space versus query time tradeoff of an index.
Notes
Acknowledgments
The authors would like to thank several researchers present at the Stringmasters 2013 workshop in Verona for introducing the problem and comments on the preliminary solution: Péter Burcsi, Ferdinando Cicalese, Gabriele Fici, Travis Gagie, Arnaud Lefebvre and Zsuzsanna Lipták. We are especially grateful to Ferdinando Cicalese and Travis Gagie for very valuable remarks. The authors thank anonymous reviewers for numerous comments that helped significantly improve the presentation of the article. Tomasz Kociumaka is supported by Polish budget funds for science in 2013–2017 as a research project under the ‘Diamond Grant’ program, Grant No. 0179/DIA/2013/42. Jakub Radoszewski is supported by the Polish Ministry of Science and Higher Education under the ‘Iuventus Plus’ program in 2015–2016, Grant No. 0392/IP3/2015/73. He also receives financial support of Foundation for Polish Science. Wojciech Rytter is supported by the Polish National Science Center, Grant No. 2014/13/B/ST6/00770.
References
 1.Amir, A., Butman, A., Porat, E.: On the relationship between histogram indexing and blockmass indexing. Philos. Trans. A. Math. Phys. Eng. Sci. 372(2016), 20130132 (2014). doi: 10.1098/rsta.2013.0132
 2.Amir, A., Chan, T.M., Lewenstein, M., Lewenstein, N.: On hardness of jumbled indexing. In: Esparza, J., Fraigniaud, P., Husfeldt, T., Koutsoupias, E. (eds.) Automata, Languages, and Programming (ICALP 2014), Part I. LNCS, vol. 8572, pp. 114–125. Springer, Berlin (2014)Google Scholar
 3.Babenko, M., Gawrychowski, P., Kociumaka, T., Starikovskaya, T.: Wavelet trees meet suffix trees. In: Indyk, P. (ed.) 26th Annual ACMSIAM Symposium on Discrete Algorithms (SODA 2015), pp. 572–591. SIAM (2015)Google Scholar
 4.Björklund, A., Kaski, P., Kowalik, Ł.: Constrained multilinear detection and generalized graph motifs. Algorithmica 74(2), 947–967 (2016)Google Scholar
 5.Bremner, D., Chan, T.M., Demaine, E.D., Erickson, J., Hurtado, F., Iacono, J., Langerman, S., Pătraşcu, M., Taslakian, P.: Necklaces, convolutions, and X + Y. Algorithmica 69(2), 294–314 (2014)Google Scholar
 6.Burcsi, P., Cicalese, F., Fici, G., Lipták, Z.: On table arrangements, scrabble freaks, and jumbled pattern matching. In: Boldi, P., Gargano, L. (eds.) Fun with Algorithms (FUN 2010). LNCS, vol. 6099, pp. 89–101. Springer, Berlin (2010)Google Scholar
 7.Burcsi, P., Cicalese, F., Fici, G., Lipták, Z.: Algorithms for jumbled pattern matching in strings. Int. J. Found. Comput. Sci. 23(2), 357–374 (2012)MathSciNetCrossRefMATHGoogle Scholar
 8.Burcsi, P., Cicalese, F., Fici, G., Lipták, Z.: On approximate jumbled pattern matching in strings. Theory Comput. Syst. 50(1), 35–51 (2012)MathSciNetCrossRefMATHGoogle Scholar
 9.Butman, A., Eres, R., Landau, G.M.: Scaled and permuted string matching. Inf. Process. Lett. 92(6), 293–297 (2004)MathSciNetCrossRefMATHGoogle Scholar
 10.Chan, T.M., Lewenstein, M.: Clustered integer 3SUM via additive combinatorics. In: Servedio, R.A., Rubinfeld, R. (eds.) 47th Annual ACM Symposium on Theory of Computing (STOC 2015), pp. 31–40. ACM (2015)Google Scholar
 11.Cicalese, F., Fici, G., Lipták, Z.: Searching for jumbled patterns in strings. In: Holub, J., Ž\(\check{\text{ d }}\)árek, J. (eds.) Prague Stringology Conference 2009. Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague, Prague Stringology Club (2009)Google Scholar
 12.Cicalese, F., Gagie, T., Giaquinta, E., Laber, E.S., Lipták, Z., Rizzi, R., Tomescu, A.I.: Indexes for jumbled pattern matching in strings, trees and graphs. In: Kurland, O., Lewenstein, M., Porat, E. (eds.) String Processing and Information Retrieval (SPIRE 2013). LNCS, vol. 8214, pp. 56–63. Springer, Berlin (2013)Google Scholar
 13.Fellows, M.R., Fertin, G., Hermelin, D., Vialette, S.: Upper and lower bounds for finding connected motifs in vertexcolored graphs. J. Comput. Syst. Sci. 77(4), 799–811 (2011)MathSciNetCrossRefMATHGoogle Scholar
 14.Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with \(O(1)\) worst case access time. J. ACM 31(3), 538–544 (1984)MathSciNetCrossRefMATHGoogle Scholar
 15.Gagie, T., Hermelin, D., Landau, G.M., Weimann, O.: Binary jumbled pattern matching on trees and treelike structures. Algorithmica 73(3), 571–588 (2015)MathSciNetCrossRefMATHGoogle Scholar
 16.Hagerup, T.: Sorting and searching on the word RAM. In: Morvan, M., Meinel, C., Krob, D. (eds.) Symposium on Theoretical Aspects of Computer Science (STACS 1998). LNCS, vol. 1373, pp. 366–398. Springer, Berlin (1998)Google Scholar
 17.Hermelin, D., Landau, G.M., Rabinovich, Y., Weimann, O.: Binary Jumbled Pattern Matching Via AllPairs Shortest Paths. arXiv:1401.2065 (2014)
 18.Kociumaka, T., Radoszewski, J., Rytter, W.: Efficient indexes for jumbled pattern matching with constantsized alphabet. In: Bodlaender, H.L., Italiano, G.F. (eds.) Algorithms (ESA 2013). LNCS, vol. 8125, pp. 625–636. Springer, Berlin (2013)Google Scholar
 19.Lacroix, V., Fernandes, C.G., Sagot, M.: Motif search in graphs: application to metabolic networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 3(4), 360–368 (2006)CrossRefGoogle Scholar
 20.Moosa, T.M., Rahman, M.S.: Indexing permutations for binary strings. Inf. Process. Lett. 110(18–19), 795–798 (2010)MathSciNetCrossRefMATHGoogle Scholar
 21.Moosa, T.M., Rahman, M.S.: Subquadratic time and linear space data structures for permutation matching in binary strings. J. Discret. Algorithms 10, 5–9 (2012)MathSciNetCrossRefMATHGoogle Scholar
 22.Munro, J.I., Nekrich, Y., Vitter, J.S.: Fast construction of wavelet trees. In: de Moura, E.S., Crochemore, M. (eds.) String Processing and Information Retrieval (SPIRE 2014). LNCS, vol. 8799, pp. 101–110. Springer, Berlin (2014) (accepted to Theoretical Computer Science. doi: 10.1016/j.tcs.2015.11.011)
 23.Williams, R.: Faster allpairs shortest paths via circuit complexity. In: Shmoys, D.B. (ed.) 46th Annual ACM Symposium on Theory of Computing (STOC 2014), pp. 664–673. ACM (2014)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.