Application-Oriented Succinct Data Structures for Big Data

A data structure is called succinct if its asymptotical space requirement matches the original data size. The development of succinct data structures is an important factor to deal with the explosively increasing big data. Moreover, wider variations of big data have been produced in various fields recently and there is a substantial need for the development of more application-specific succinct data structures. In this study, we review the recently proposed application-oriented succinct data structures motivated by big data applications in three different fields: privacy-preserving computation in cryptography, genome assembly in bioinformatics, and work space reduction for compressed communications.


Introduction
The amount of data being generated in various fields is increasing rapidly. For example, the amount of DNA data obtained by next-generation sequencers (NGSs) doubles almost every year [19], which is much faster than the pace of Moore's law [27,53]. Hence, more sophisticated algorithms and data structures are highly desired for big data.
A data structure is said to be succinct if its additional space requirement is sublinear, i.e., o(n), where n is the size of target data. In the last two decades, significant work has been done toward the development of succinct versions of various data structures for manipulating big data. In this paper, we introduce recent work on succinct data structures driven by actual big data applications. This work was supported by JST CREST Grant number JPMJCR1402, JSPS KAKENHI Grant numbers 17H01693, and 17K20023JST.

3
One of the most fundamental tasks for big data is to search for a substring in a text database. Traditionally, we can search a substring in O(n) time using the Knuth-Morris-Pratt (KMP) algorithm with O(m) working space, where n is the text size and m is the query size [35]. However, when there are multiple online queries, the KMP algorithm requires O(k ⋅ n) time, where k is the number of queries, which may be inefficient for large n. Aho-Corasick algorithm [1] can find all occ occurrences in O(n + k ⋅ m + occ) time with O(k ⋅ m) working space, but it cannot process queries online. For this purpose, the suffix tree was proposed [22,51,73,75]. It enables O(m log + occ)-time query after building a (n)-size suffix tree in O(n) time, where is the alphabet size. However, the suffix tree requires a large space. Later, the suffix array was proposed, which is much smaller than the suffix tree [34,36,46,55]. FM index based on Burrows-Wheeler transform [13] was then proposed [24], It requires o(n) space in addition to the original text size, which is much smaller than the suffix array. Now, the FM index is known as one of the most famous succinct data structures used in various fields. For example, the FM index is now a necessary component in various large genome projects in molecular biology [40,45].
There are various indexing problems for various targets other than just texts and many succinct data structures have been proposed for them. For example, the suffix tree of a tree is a data structure for indexing paths on a tree [37,65], for which XBW was proposed as a succinct counterpart [23]. Another type of succinct data structure was proposed for parameterized indexing [26], where we need to index strings of parameterized characters [3,66].
Succinct data structures are not restricted to indexing. The topology of a tree can be represented in a very small space if we use balanced parentheses (BP) representation or level-order unary degree sequence (LOUDS) representation [33]. Various succinct data structures have been proposed for them [6,62] and they can be applied to various data in tree form, e.g., XML data.
Nowadays, there is an increasing demand for succinct data structures in various applications. In this paper, we introduce three examples of application-driven succinct data structure research. The first example is research motivated from privacy-preserving computation in cryptography. Privacy is an important keyword in big data systems, but many current technologies for privacy cannot be easily implemented for big data applications owing to too a large computation time and/or working space. In Sect. 2, we introduce the research on succinct data structures for oblivious RAMs (ORAMs), which are used for hiding access patterns. Another example is on a topic related to genome assembly problem, which is an important bioinformatics topic in current molecular biology. In Sect. 3, we introduce the research on de Bruijn graphs, which are used for genome assembly. The third example is on compressed communication, where the data are compressed and decompressed in real time. In Sect. 4, we introduce research on succinct techniques for online compression.

Succinct Oblivious RAM
Currently, there is more emphasis on privacy while dealing with big data. There are various encryption techniques useful for hiding the content of data, but recent studies have revealed that adversaries might learn various things from access patterns [14,31], implying that there are cases where the encryption of data are not sufficient.
In literature, we consider two different situations where ORAMs are used. In one situation, we store data on an insecure cloud database (i.e., accessed positions can be viewed by the adversary) and we access data from a secure client PC. Here, we need to hide our access patterns, i.e., any information related to the order of accessed positions, except for the number of accesses. In the other situation, we store data on an insecure RAM and we access data from a secure CPU. Here, the access patterns on the RAM are hidden. In the following, we use the terms database and client instead of cloud database/RAM and client PC/CPU.
An ORAM is a database simulation where we need to read/write data of n blocks of B bits without leaking the access patterns. Here, we assume that data in blocks are probabilistically encrypted, i.e., the adversary cannot know the content of the block and moreover, the adversary cannot know whether two blocks have the same content.
There are several metrics to evaluate the performance of ORAM (Table 1). The most important metric is bandwidth blowup, which is the complexity of the Onodera et al. [57] O(log 2 n) Patel et al. [58] O(log n ⋅ poly(log log n)) number of actual accesses per one simulated access on the ORAM. It should be noted that a lower bound (log n) for the bandwidth blowup is known [11,28,41]. Until recently, the best known ORAM was Balanced ORAM (B-ORAM) proposed by Kushlevits et al. [38], whose bandwidth blowup is O(log 2 n∕ log log n) . It should be noted that the B-ORAM has a large constant before its bandwidth blowup complexity [15]. Recently, PanORAMa proposed by Patel et al. [58] achieved O(log n ⋅ poly(log log n)) bandwidth blowup. More recently, OptORAMa with tight bandwidth blowup was proposed by Asharov et al. [2]. It should be noted that PanO-RAMa and OptORAMa are complicated algorithms and rather difficult to implement. In contrast, Path ORAM proposed by Stefanov et al. [71] is said to be one of the most practical ORAMs, with reasonable bandwidth blowup complexity of O(log 2 n) [15]. In addition, it is easy to implement.
There are two other important metrics, i.e., client storage size and database storage overhead. Client storage size is the data size that a client is allowed to have. Most ORAMs assume the client storage sizes at most O(polylog n) , though there are practical ORAMs that allow larger theoretical complexities [70].
Because we need to store n simulated blocks of data, we require server storage of at least n blocks (of size B) for any type of ORAMs. Database storage overhead is the additional storage size (in number of blocks) required by ORAM in addition to the unavoidable n blocks.
In any known ORAM in literature, each block on the database has additional metadata containing its original address of O(log n) bits for verifying its correctness. This implies that all known ORAMs have a database storage overhead of (n ⋅ log n∕B) blocks. In other words, we cannot design succinct ORAMs under the assumption that B = O(log n) , while using the current metadata strategy. So far, all the ORAMs that have been proposed assume (log n) size blocks, and many ORAMs assume blocks of size (log n) [57,71,74].
An ORAM is said to be succinct if its database storage overhead size is o(n). Unfortunately, most known ORAMs require (n) database storage overhead. Some ORAMs have even larger storage overhead [29,64]. As of today, only two succinct ORAMs are known under the assumption that B = (log n) . Until very recently, the SR-ORAM, the first ORAM proposed by Goldreich [28] was the only succinct ORAM. SR-ORAM achieves a database storage overhead of ((n ⋅ log n∕B) + √ n) blocks. However, its bandwidth blowup is almost impractical ( O( √ n log n) ). Recently, succinct ORAM, which is a variation of the Path ORAM [71], is proposed by Onodera et al. [57]. It is the first and the only known succinct ORAM with reasonable O(polylog n) bandwidth blowup. Practically, the Succinct ORAM requires only two or three times larger storage overhead, while ordinary tree-based ORAMs like Path ORAM require almost ten times the storage overhead [57].
There are several open problems. One question is whether we can design a succinct ORAM with tighter bandwidth blowup. Existing ORAMs with o(log 2 n) bandwidth blowup are hash-based, and ordinary hashes require (n) empty slots in nature to keep ORAMs secure and efficient. A succinct hash data structure for a very restricted class of keywords is known [61], but it does not seem to be applicable to ORAMs.

3
The Review of Socionetwork Strategies (2019) 13:227-236 Another question is about the existence of succinct ORAMs with better client storage size, keeping the O(polylog n) bandwidth blowup. It should be noted that the SR-ORAM requires optimal O(1) client storage, but its bandwidth blowup is large.
Another question that arises is whether a succinct ORAM for B = O(log n) can be designed. As already stated above, all known ORAMs, including the above two succinct ORAMs (the SR-ORAM and the Succinct ORAM), maintain O(log n)bit metadata for each block. We need a totally different approach for manipulating metadata to achieve it.

Succinct de Bruijn Graph
A tremendous amount of DNA data are obtained in genome science these days with the advent of next-generation sequencers (NGSs). Although NGSs can sequence DNA rapidly with low cost, the current NGSs can read only short sequences and cannot read the entire genome. Therefore, we need to estimate the entire genome sequence with some computational algorithms, considering information from its short substrings called reads. This computational task is called genome assembly and many algorithms have been proposed for it [21,56,67].
The de Bruijn graph is a graph data structure used in many recent genome assembly algorithms. The original concept was proposed by de Bruijn for graph theory [12], and Pevzner et al. used a variation of the graph for genome assembly [60].
Ideally, we can enumerate all k-mers that occur in the target genome with an NGS if we assume an ideal case where the sequenced reads ideally cover the entire genome with no errors (i.e., any k-mer substring of the entire genome appears as a substring of some read, and each read has no errors). Then, a simplified genome assembly problem can be considered as the following.
Ideal k-mer Assembly Problem Given a set M of k-mers, find the shortest string which contains k-mers in M but no other k-mers as its substrings.
We can solve this problem by computing an Eulerian path (i.e., one of the shortest paths that uses all edges on the graph) on the de Bruijn graph of M ; in case each k-mer in M occurs exactly once in the shortest string [60]. It should be noted that we can compute an Eulerian path on a general graph in time linear to the graph size.

3
Most recent genome assemblers use the de Bruijn graph for a set of all k-mers on all reads (or those on all screened confident reads) obtained by NGSs. To store a de Bruijn graph naively, we require space of (n ⋅ (k + log n)) bits, where n is the number of edges in the de Bruijn graph, as each edge requires (k) bits for storing its label and (log n) bits for storing a pointer to its next node. It becomes prohibitively large for very large genomes. For example, they required > 4.3 TB memory for storing distributed de Bruijn graphs to assemble the 20 Gbp genome of white spruce [7]. In cases of metagenome analyses, we often need to deal with even larger data [30].
Several approaches have been proposed to reduce the memory requirement of de Bruijn graphs. The first compact representation was proposed by Conway et al. [18], where they represented the graph with some compressed bit encoding. They succeeded in storing a de Bruijn graph ( k = 27 ) with 12, 292, 819, 311 edges in 40.8 GB space, i.e., 28.5 bits per edge, which is significant improvement over naive implementation.
Ye et al. proposed another approach to use a sparse sampled graph instead of the entire de Bruijn graph [77]. Pell et al. [59] proposed another heuristic approach based on Bloom filter [8]. They utilized the Bloom filter to heuristically represent de Bruijn graphs, where the represented graph is not exactly same as the original de Bruijn graph. Chikhi et al. improved this Bloom filter-based approach to represent de Bruijn graphs without any errors [16,17]. They succeeded in storing a de Bruijn graph in O(n log k) bits, which is a theoretical improvement over the naive implementation. This strategy is used in the assembler ABySS [32].
A succinct data structure for de Bruijn graphs, called the succinct de Bruijn graph, is proposed by Bowe et al. [10]. They extended the XBW data structure [23], so that it can represent de Bruijn graphs. This data structure can store a de Bruijn graph in just 4n + o(n) bits, which is usually lesser than 5 bits per edge, and is independent from k. The succinct de Bruijn graph is used in the assembler MegaHIT [43,44]. There are several extensions of the succinct de Bruijn graph, e.g., succinct data structures for variable-order de Bruijn graphs [5,9], dynamic de Bruijn graphs [4], colored de Bruijn graphs [54], and genome graphs [69].
However, this implementation raises some questions. One question is about the existence of distributed succinct representations of de Bruijn graphs. As related work, there is an attempt to store non-succinct de Bruijn graphs in distributed space [68]. Note also that there is an attempt to build FM index in parallel in shared memory [39].
A class of graphs (called Wheeler graphs) that can be succinctly indexed by BWT-related techniques is discussed in [25]. Another question is whether we can expand the class of succinctly representable graphs.

Succinct Schemes for Compressed Communications
There is a need to transmit a tremendous amount of data between different sites. We can reduce communications by just compressing data, but we need to pay costs for compressing/decompressing data before/after communication. It becomes a problem in IoT communications where we only have small hardware with very small working space. Hence, we need to design efficient online algorithms for compressing/decompressing data with small restricted working space.
There are various kinds of compression algorithms [63]. Among them, there is a group of compression algorithms that utilize the inference of context-free grammars (CFGs). Famous examples are LZ78 [78] and Re-pair [52]. They are called grammar compression algorithms. It should be noted that most of them infer CFGs heuristically, because the inference of the smallest CFG is known to be NP hard [42]. Many grammar compression algorithms are known to achieve very high compression ratio, but unfortunately, many of them require a large working space. In addition, some of these algorithms have large latency and cannot be applied to compressed communication. Thus, we need to design online grammar compression algorithms with less working space to achieve compressed communication for IoT applications.
Re-pair [52] is a fast grammar compression algorithm, but it requires O(N) working space, where N is the input size. Masaki et al. [50] reduced the working space to O(n), where n is the grammar size.
FOLCA [49] is the first O(log n log * n)-approximable online grammar compression algorithm with succinct working space based on edit-sensitive parsing (ESP). It uses only n lg(n + ) + 2n + o(n) bits of working space, while a naive implementation requires 2n log(n + ) bits, where n is the grammar size and is the alphabet size. They achieved it by improving the compression algorithm called LCA [48]. More recently, Takabatake et al. improved it by proposing SOLCA [72], where the same approximability was achieved with optimal O(N log log n) computation time for an input of size N, using n log(n + ) + o(n log(n + )) working space.
Research on implementation of online grammar compression algorithms on small-sized hardware is ongoing. Yamagiwa et al. implemented FPGA-based lossless compression hardware [47,76]. They utilized a compact-memory grammar compression algorithm called LCA-DLT, which is another variation of LCA [48].
There are also open problems. One problem is on the existence of online succinct self-indices, which could help to abuse detection on compressed communications. Another problem is on the development of compressed broadcasting/uploading algorithms on various network structure settings.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.