Experimental setup
Document collections
We performed extensive experiments with both real and synthetic collections.Footnote 7 Most of our document collections were relatively small, around 100 MB in size, as some of the implementations (Navarro et al. 2014b) use 32-bit libraries. We also used larger versions of some collections, up to 1 GB in size, to see how the collection size affects the results. In general, collection size is more important in top-k document retrieval. Increasing the number of documents generally increases the \({\textsf{df}}/k\) ratio, and thus makes brute-force solutions based on document listing less appealing. In document listing, the size of the documents is more important than collection size, as a large \(occ /{\textsf{df}}\) ratio makes brute-force solutions based on pattern matching less appealing.
The performance of various solutions depends both on the repetitiveness of the collection and the type of the repetitiveness. Hence we used a fair number of real and synthetic collections with different characteristics for our experiments. We describe them next, and summarize their statistics in Table 2.
Table 2 Statistics for document collections (small, medium, and large variants)
A note on collection size The index structures evaluated in this paper should be understood as promising algorithmic ideas. In most implementations, the construction algorithms do not scale up for collections larger than a couple of gigabytes. This is often intentional. In this line of research, being able to easily evaluate variations of the fundamental idea is more important than the speed or memory usage of construction. As a result, many of the construction algorithms build an explicit suffix tree for the collection and store various kinds of additional information in the nodes. Better construction algorithms can be designed once the most promising ideas have been identified. See “Appendix 2” for further discussion on index construction.
Real collections We use various document collections from real-life repetitive scenarios. Some collections come in small, medium, and large variants. Page and Revision are repetitive collections generated from a Finnish-language Wikipedia archive with full version history. There are 60 (small), 190 (medium), or 280 (large) pages with a total of 8834, 31,208, or 65,565 revisions. In Page, all the revisions of a page form a single document, while each revision becomes a separate document in Revision. Enwiki is a non-repetitive collection of 7000, 44,000, or 90,000 pages from a snapshot of the English-language Wikipedia. Influenza is a repetitive collection containing 100,000 or 227,356 sequences from influenza virus genomes (we only have small and large variants). Swissprot is a non-repetitive collection of 143,244 protein sequences used in many document retrieval papers (e.g., Navarro et al. 2014b). As the full collection is only 54 MB, only the small version of Swissprot exists. Wiki is a repetitive collection similar to Revision. It is generated by sampling all revisions of 1% of pages from the English-language versions of Wikibooks, Wikinews, Wikiquote, and Wikivoyage.
Synthetic collections To explore the effect of collection repetitiveness on document retrieval performance in more detail, we generated three types of synthetic collections, using files from the Pizza & Chili corpus.Footnote 8
DNA is similar to Influenza. Each collection has d = 1, 10, 100, or 1000 base documents, 100,000/d variants of each base document, and mutation rate p = 0.001, 0.003, 0.01, 0.03, or 0.1. We take a prefix of length 1000 from the Pizza & Chili DNA file and generate the base documents by mutating the prefix at probability 10p under the same model as in Fig. 5. We then generate the variants in the same way with mutation rate p. Concat and Version are similar to Page and Revision, respectively. We read d = 10, 100, or 1000 base documents of length 10,000 from the Pizza & Chili English file, and generate 10,000/d variants of each base document with mutation rates 0.001, 0.003, 0.01, 0.03, and 0.1, as above. Each variant becomes a separate document in Version, while all variants of the same base document are concatenated into a single document in Concat.
Queries
Real collections For Page and Revision, we downloaded a list of Finnish words from the Institute for the Languages in Finland, and chose all words of length ≥5 that occur in the collection. For Enwiki, we used search terms from an MSN query log with stopwords filtered out. We generated 20,000 patterns according to term frequencies, and selected those that occur in the collection. For Influenza, we extracted 100,000 random substrings of length 7, filtered out duplicates, and kept the 1000 patterns with the largest \(occ /{\textsf{df}}\) ratios. For Swissprot, we extracted 200,000 random substrings of length 5, filtered out duplicates, and kept the 10,000 patterns with the largest \(occ /{\textsf{df}}\) ratios. For Wiki, we used the TREC 2006 Terabyte Track efficiency queriesFootnote 9 consisting of 411,394 terms in 100,000 queries.
Synthetic collections We generated the patterns for DNA with a similar process as for Influenza and Swissprot. We extracted 100,000 substrings of length 7, filtered out duplicates, and chose the 1000 with the largest \(occ /{\textsf{df}}\) ratios. For Concat and Version, patterns were generated from the MSN query log in the same way as for Enwiki.
Test environment
We used two separate systems for the experiments. For document listing and document counting, our test environment had two 2.40 GHz quad-core Intel Xeon E5620 processors and 96 GB memory. Only one core was used for the queries. The operating system was Ubuntu 12.04 with Linux kernel 3.2.0. All code was written in C++. We used g++ version 4.6.3 for the document listing experiments and version 4.8.1 for the document counting experiments.
For the top-k retrieval and \({\textsf{tf}}{\textsf{-}}{\textsf{idf}}\) experiments, we used another system with two 16-core AMD Opteron 6378 processors and 256 GB memory. We used only a single core for the single-term queries and up to 32 cores for the multi-term queries. The operating system was Ubuntu 12.04 with Linux kernel 3.2.0. All code was written in C++ and compiled with g++ version 4.9.2.
We executed the query benchmarks in the following way:
-
1.
Load the RLCSA with the desired sample period for the current collection into memory.
-
2.
Load the query patterns corresponding to the collection into memory and execute \({\textsf{find}}\) queries in the RLCSA. Store the resulting lexicographic ranges [ℓ..r] in vector V.
-
3.
Load the index to be benchmarked into memory.
-
4.
Iterate through vector V once using a single thread and execute the desired query for each range [ℓ..r]. Measure the total wall clock time for executing the queries.
We divided the measured time by the number of patterns, and listed the average time per query in milliseconds or microseconds and the size of the index structure in bits per symbol. There were certain exceptions:
-
LZ and Grammar do not use a \({\textsf{CSA}}\). With them, we iterated through the vector of patterns as in step 4, once the index and the patterns had been loaded into memory. The average time required to get the range [ℓ..r] in \({\textsf{CSA}}\)-based indexes (4−6 μs, depending on the collection) was negligible compared to the average query times of LZ (at least 170 μs) and Grammar (at least 760 μs).
-
We used the existing benchmark code with SURF. The code first loads the index into memory and then iterates through the pattern file by reading one line at a time. To reduce the overhead from reading the patterns, we cached them by using cat> /dev/null. Because SURF queries were based on the pattern instead of the corresponding range [ℓ..r], we executed \({\textsf{find}}\) queries first and subtracted the time used for them from the subsequent top-k queries.
-
In our \({\textsf{tf}}{\textsf{-}}{\textsf{idf}}\) index, we parallelized step 4 using the OpenMP parallel for construct.
-
We used the existing benchmark code with Terrier. We cached the queries as with SURF, set trec.querying.outputformat to NullOutputFormat, and set the logging level to off.
Document listing
We compare our new proposals from Sects. 3.3 and 4.1 to the existing document listing solutions. We also aim to determine when these sophisticated approaches are better than brute-force solutions based on pattern matching.
Indexes
Brute force (Brute) These algorithms simply sort the document identifiers in the range \({\textsf{DA}}[\ell ..r]\) and report each of them once. Brute-D stores \({\textsf{DA}}\) in \(n \lg d\) bits, while Brute-L retrieves the range \({\textsf{SA}}[\ell ..r]\) with the \({\textsf{locate}}\) functionality of the \({\textsf{CSA}}\) and uses bitvector B to convert it to \({\textsf{DA}}[\ell ..r]\).
Sadakane (Sada) This family of algorithms is based on the improvements of Sadakane (2007) to the algorithm of Muthukrishnan (2002). Sada-L is the original algorithm, while Sada-D uses an explicit document array \({\textsf{DA}}\) instead of retrieving the document identifiers with \({\textsf{locate}}\).
ILCP (ILCP) This is our proposal in Sect. 3.3. The algorithms are the same as those of Sadakane (2007), but they run on the run-length encoded \({\textsf{ILCP}}\) array. As for Sada, ILCP-L obtains the document identifiers using \({\textsf{locate}}\) on the \({\textsf{CSA}}\), whereas ILCP-D stores array \({\textsf{DA}}\) explicitly.
Wavelet tree (WT) This index stores the document array in a wavelet tree (Sect. 2.2) to efficiently find the distinct elements in \({\textsf{DA}}[\ell ..r]\) (Välimäki and Mäkinen 2007). The best known implementation of this idea (Navarro et al. 2014b) uses plain, entropy-compressed, and grammar-compressed bitvectors in the wavelet tree-depending on the level. Our WT implementation uses a heuristic similar to the original WT-alpha (Navarro et al. 2014b), multiplying the size of the plain bitvector by 0.81 and the size of the entropy-compressed bitvector by 0.9, before choosing the smallest one for each level of the tree. These constants were determined by experimental tuning.
Precomputed document lists (PDL) This is our proposal in Sect. 4.1. Our implementation resorts to Brute-L to handle the short regions that the index does not cover. The variant PDL-BC compresses sets of equal documents using a Web graph compressor (Hernández and Navarro 2014). PDL-RP uses Re-Pair compression (Larsson and Moffat 2000) as implemented by NavarroFootnote 10 and stores the dictionary in plain form. We use block size b = 256 and storing factor β = 16, which have proved to be good general-purpose parameter values.
Grammar-based (Grammar) This index (Claude and Munro 2013) is an adaptation of a grammar-compressed self-index (Claude and Navarro 2012) to document listing. Conceptually similar to PDL, Grammar uses Re-Pair to parse the collection. For each nonterminal symbol in the grammar, it stores the set of identifiers of the documents whose encoding contains the symbol. A second round of Re-Pair is used to compress the sets. Unlike most of the other solutions, Grammar is an independent index and needs no \({\textsf{CSA}}\) to operate.
Lempel-Ziv (LZ) This index (Ferrada and Navarro 2013) is an adaptation of a pattern-matching index based on LZ78 parsing (Navarro 2004) to document listing. Like Grammar, LZ does not need a \({\textsf{CSA}}\).
We implemented Brute, Sada, \({\textsf{ILCP}}\), and the PDL variants ourselvesFootnote 11 and modified existing implementations of WT, Grammar, and LZ for our purposes. We always used the RLCSA (Mäkinen et al. 2010) as the \({\textsf{CSA}}\), as it performs well on repetitive collections. The \({\textsf{locate}}\) support in RLCSA includes optimizations for long query ranges and repetitive collections, which is important for Brute-L and ILCP-L. We used suffix array sample periods 8, 16, 32, 64, 128 for non-repetitive collections and 32, 64, 128, 256, 512 for repetitive ones.
When a document listing solution uses a \({\textsf{CSA}}\), we start the queries from the lexicographic range [ℓ..r] instead of the pattern P. This allows us to see the performance differences between the fastest solutions better. The average time required for obtaining the ranges was 4–6 μs per pattern, depending on the collection, which is negligible compared to the average time used by Grammar (at least 760 μs) and LZ (at least 170 μs).
Results
Real collections Figures 6 and 7 contain the results for document listing with small and large real collections, respectively. For most of the indexes, the time/space trade-off is given by the RLCSA sample period. The trade-off of LZ comes from a parameter specific to that structure involving RMQs (Ferrada and Navarro 2013). Grammar has no trade-off.
Brute-L always uses the least amount of space, but it is also the slowest solution. In collections with many short documents (i.e., all except Page), we have \(occ /{\textsf{df}}< 4\) on the average. The additional effort made by Sada-L and ILCP-L to report each document only once does not pay off, and the space used by the RMQ structure is better spent on increasing the number of suffix array samples for Brute-L. The difference is, however, very noticeable on Page, where the documents are large and there are hundreds of occurrences of the pattern in each document. ILCP-L uses less space than Sada-L when the collection is repetitive and contains many similar documents (i.e., on Revision and Influenza); otherwise Sada-L is slightly smaller.
The two PDL alternatives usually achieve similar performance, but in some cases PDL-BC uses much less space. PDL-BC, in turn, can use significantly more space than Brute-L, Sada-L, and ILCP-L, but is always orders of magnitude faster. The document sets of versioned collections such as Page and Revision are very compressible, making the collections very suitable for PDL. On the other hand, grammar-based compression cannot reduce the size of the stored document sets enough when the collections are non-repetitive. Repetitive but unstructured collections like Influenza represent an interesting special case. When the number of revisions of each base document is much larger than the block size b, each leaf block stores an essentially random subset of the revisions, which cannot be compressed very well.
Among the other indexes, Sada-D and ILCP-D can be significantly faster than PDL-BC, but they also use much more space. From the non-\({\textsf{CSA}}\)-based indexes, Grammar reaches the Pareto-optimal curve on Revision and Influenza, while being too slow or too large on the other collections. We did not build Grammar for the large version of Page, as it would have taken several months.
In general, we can recommend PDL-BC as a medium-space alternative for document listing. When less space is available, we can use ILCP-L, which offers robust time and space guarantees. If the documents are small, we can even use Brute-L. Further, we can use fast document counting to compare \({\textsf{df}}\) with occ = r − ℓ + 1, and choose between ILCP-L and Brute-L according to the results.
Synthetic collections Figures 8 and 9 show our document listing results with synthetic collections. Due to the large number of collections, the results for a given collection type and number of base documents are combined in a single plot, showing the fastest algorithm for a given amount of space and mutation rate. Solid lines connect measurements that are the fastest for their size, while dashed lines are rough interpolations.
The plots were simplified in two ways. Algorithms providing a marginal and/or inconsistent improvement in speed in a very narrow region (mainly Sada-L and ILCP-L) were left out. When PDL-BC and PDL-RP had a very similar performance, only one of them was chosen for the plot.
On DNA, Grammar was a good solution for small mutation rates, while LZ was good with larger mutation rates. With more space available, PDL-BC became the fastest algorithm. Brute-D and ILCP-D were often slightly faster than PDL, when there was enough space available to store the document array. On Concat and Version, PDL was usually a good mid-range solution, with PDL-RP being usually smaller than PDL-BC. The exceptions were the collections with 10 base documents, where the number of variants (1000) was clearly larger than the block size (256). With no other structure in the collection, PDL was unable to find a good grammar to compress the sets. At the large end of the size scale, algorithms using an explicit document array \({\textsf{DA}}\) were usually the fastest choices.
Top-k retrieval
Indexes
We compare the following top-k retrieval algorithms. Many of them share names with the corresponding document listing structures described in Sect. 7.2.1.
Brute force (Brute) These algorithms correspond to the document listing algorithms Brute-D and Brute-L. To perform top-k retrieval, we not only collect the distinct document identifiers after sorting \({\textsf{DA}}[\ell ..r]\), we also record the number of times each one appears. The k identifiers appearing most frequently are then reported.
Precomputed document lists (PDL) We use the variant of PDL-RP modified for top-k retrieval, as described in Sect. 4.2. PDL-
b denotes PDL with block size b and with document sets for all suffix tree nodes above the leaf blocks, while PDL-
b
+F is the same with term frequencies. PDL-
b-β is PDL with block size b and storing factor β.
Large and fast (SURF) This index (Gog and Navarro 2015b) is based on a conceptual idea by Navarro and Nekrich (2012), and improves upon a previous implementation (Konow and Navarro 2013). It can answer top-k queries quickly if the pattern occurs at least twice in each reported document. If documents with just one occurrence are needed, SURF uses a variant of Sada-L to find them.
We implemented the Brute and PDL variants ourselvesFootnote 12 and used the existing implementation of SURF.Footnote 13 While WT (Navarro et al. 2014b) also supports top-k queries, the 32-bit implementation cannot index the large versions of the document collections used in the experiments. As with document listing, we subtracted the time required for finding the lexicographic ranges [ℓ..r] using a \({\textsf{CSA}}\) from the measured query times. SURF uses a \({\textsf{CSA}}\) from the SDSL library (Gog et al. 2014), while the rest of the indexes use RLCSA.
Results
Figure 10 contains the results for top-k retrieval using the large versions of the real collections. We left Page out of the results, as the number of documents (280) was too low for meaningful top-k queries. For most of the indexes, the time/space trade-off is given by the RLCSA sample period, while the results for SURF are for the three variants presented in the paper.
The three collections proved to be very different. With Revision, the PDL variants were both fast and space-efficient. When storing factor β was not set, the total query times were dominated by rare patterns, for which PDL had to resort to using Brute-L. This also made block size b an important time/space trade-off. When the storing factor was set, the index became smaller and slower and the trade-offs became less significant. SURF was larger and faster than Brute-D with k = 10 but became slow with k = 100.
On Enwiki, the variants of PDL with storing factor β set had a performance similar to Brute-D. SURF was faster with roughly the same space usage. PDL with no storing factor was much larger than the other solutions. However, its time performance became competitive for k = 100, as it was almost unaffected by the number of documents requested.
The third collection, Influenza, was the most surprising of the three. PDL with storing factor β set was between Brute-L and Brute-D in both time and space. We could not build PDL without the storing factor, as the document sets were too large for the Re-Pair compressor. The construction of SURF also failed with this dataset.
Document counting
Indexes
We use two fast document listing algorithms as baseline document counting methods (see Sect. 7.2.1): Brute-D sorts the query range \({\textsf{DA}}[\ell ..r]\) to count the number of distinct document identifiers, and PDL-RP returns the length of the list of documents obtained. Both indexes use the RLCSA with suffix array sample period set to 32 on non-repetitive datasets, and to 128 on repetitive datasets.
We also consider a number of encodings of Sadakane’s document counting structure (see Sect. 5). The following ones encode the bitvector H′ directly in a number of ways:
-
Sada uses a plain bitvector representation.
-
Sada-RR uses a run-length encoded bitvector as supplied in the RLCSA implementation. It uses δ-codes to represent run lengths and packs them into blocks of 32 bytes of encoded data. Each block stores how many bits and 1s are there before it.
-
Sada-RS uses a run-length encoded bitvector, represented with a sparse bitmap (Okanohara and Sadakane 2007) marking the beginnings of the 0-runs and another for the 1-runs.
-
Sada-RD uses run-length encoding with δ-codes to represent the lengths. Each block in the bitvector contains the encoding of 128 1-bits, while three sparse bitmaps are used to mark the number of bits, 1-bits, and starting positions of block encodings.
-
Sada-Gr uses a grammar-compressed bitvector (Navarro and Ordóñez 2014).
The following encodings use filters in addition to bitvector H′:
-
Sada-P-G uses Sada for H′ and a gap-encoded bitvector for the filter bitvector F. The gap-encoded bitvector is also provided in the RLCSA implementation. It differs from the run-length encoded bitvector by only encoding runs of 0-bits.
-
Sada-P-RR uses Sada for H′ and Sada-RR for F.
-
Sada-RR-G uses Sada-RR for H′ and a gap-encoded bitvector for F.
-
Sada-RR-RR uses Sada-RR for both H′ and F.
-
Sada-S uses sparse bitmaps for both H′ and the sparse filter F
S
.
-
Sada-S-S is Sada-S with an additional sparse bitmap for the 1-filter F
1
-
Sada-RS-S uses Sada-RS for H′ and a sparse bitmap for F
1.
-
Sada-RD-S uses Sada-RD for H′ and a sparse bitmap for F
1.
Finally, ILCP implements the technique described in Sect. 3.4, using the same encoding as in Sada-RS to represent the bitvectors in the wavelet tree.
Our implementations of the above methods can be found online.Footnote 14
Results
Due to the use of 32-bit variables in some of the implementations, we could not build all structures for the large real collections. Hence we used the medium versions of Page, Revision, and Enwiki, the large version of Influenza, and the only version of Swissprot for the benchmarks. We started the queries from precomputed lexicographic ranges [ℓ..r] in order to emphasize the differences between the fastest variants. For the same reason, we also left out of the plots the size of the RLCSA and the possible document retrieval structures. Finally, as it was almost always the fastest method, we scaled the plots to leave out anything much larger than plain Sada. The results can be seen in Fig. 11. Table 5 in “Appendix 1” lists the results in further detail.
On Page, the filtered methods Sada-P-RR and Sada-RR-RR are clearly the best choices, being only slightly larger than the baselines and orders of magnitude faster. Plain Sada is much faster than those, but it takes much more space than all the other indexes. Only Sada-Gr compresses the structure better, but it is almost as slow as the baselines.
On Revision, there were many small encodings with similar performance. Among those, Sada-RS-S is the fastest. Sada-S is somewhat larger and faster. As on Page, plain Sada is even faster, but it takes much more space.
The situation changes on the non-repetitive Enwiki. Only Sada-RD-S, Sada-RS-S, and Sada-Gr can compress the bitvector clearly below 1 bit per symbol, and Sada-Gr is much slower than the other two. At around 1 bit per symbol, Sada-S is again the fastest option. Plain Sada requires twice as much space as Sada-S, but is also twice as fast.
Influenza and Swissprot contain, respectively, RNA and protein sequences, making each individual document quite random. Such collections are easy cases for Sadakane’s method, and many encodings compress the bitvector very well. In both cases, Sada-S was the fastest small encoding. On Influenza, the small encodings fit in CPU cache, making them often faster than plain Sada.
Different compression techniques succeed with different collections, for different reasons, which complicates a simple recommendation for a best option. Plain Sada is always fast, while Sada-S is usually smaller without sacrificing too much performance. When more space-efficient solutions are required, the right choice depends on the type of the collection. Our ILCP-based structure, ILCP, also outperforms Sada in space on most collections, but it is always significantly larger and slower than compressed variants of Sada.
The multi-term \({\textsf{tf}}{\textsf{-}}{\textsf{idf}}\) index
We implement our multi-term index as follows. We use RLCSA as the \({\textsf{CSA}}\), PDL-256+F for single-term top-k retrieval, and Sada-S for document counting. We could have integrated the document counts into the PDL structure, but a separate counting structure makes the index more flexible. Additionally, encoding the number of redundant documents in each internal node of the suffix tree (Sada) often takes less space than encoding the total number of documents in each node of the sampled suffix tree (PDL). We use the basic \({\textsf{tf}}{\textsf{-}}{\textsf{idf}}\) scoring scheme.
We tested the resulting performance on the 1432 MB Wiki collection. RLCSA took 0.73 bps with sample period 128 (the sample period did not have a significant impact on query performance), PDL-256+F took 3.37 bps, and Sada-S took 0.13 bps, for a total of 4.23 bps (757 MB). Out of the total of 100,000 queries in the query set, there were matches for 31,417 conjunctive queries and 97,774 disjunctive queries.
The results can be seen in Table 3. When using a single query thread, the index can process 136–229 queries per second (around 4–7 ms per query), depending on the query type and the value of k. Disjunctive queries are faster than conjunctive queries, while larger values of k do not increase query times significantly. Note that our ranked disjunctive query algorithm preempts the processing of the lists of the patterns, whereas in the conjunctive ones we are forced to expand the full document lists for all the patterns; this is why the former are faster. The speedup from using 32 threads is around 18x.
Table 3 Ranked multi-term queries on the Wiki collection
Since our multi-term index offers a functionality similar to basic inverted index queries, it seems sensible to compare it to an inverted index designed for natural language texts. For this purpose, we indexed the Wiki collection using Terrier (Macdonald et al. 2012) version 4.1 with the default settings. See Table 4 for a comparison between the two indexes.
Note that the similarity in the functionality is only superficial: our index can find any text substring, whereas the inverted index can only look for indexed words and phrases. Thus our index has an index point per symbol, whereas Terrier has an index point per word (in addition, inverted indexes usually discard words deemed uninteresting, like stopwords). Note that PDL also chooses frequent strings and builds their lists of documents, but since it has many more index points, its posting lists are 200 times longer than those of Terrier, and the number of lists is 300 times larger. Thanks to the compression of its lists, however, PDL uses only 8 times more space than Terrier. On the other hand, both indexes have similar query performance. When logging and output was set to minimum, Terrier could process 231 top-10 queries and 228 top-100 queries per second under the \({\textsf{tf}}{\textsf{-}}{\textsf{idf}}\) scoring model using a single query thread.
Table 4 Our index (PDL) and an inverted index (Terrier) on the Wiki collection