Applying burst-tries for error-tolerant prefix search

Ferreira, Berg; de Moura, Edleno Silva; Silva, Altigran da

doi:10.1007/s10791-022-09416-9

Applying burst-tries for error-tolerant prefix search

Published: 18 October 2022

Volume 25, pages 481–518, (2022)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

Applying burst-tries for error-tolerant prefix search

Download PDF

Berg Ferreira ORCID: orcid.org/0000-0001-8985-5045¹,
Edleno Silva de Moura^1,2 &
Altigran da Silva¹

443 Accesses
Explore all metrics

Abstract

In this work we address the problem of performing an error-tolerant prefix search on a set of string keys. While the ideas presented here could be adopted in other applications, our primary target application is error-tolerant query autocompletion. Tries and their variations have been adopted as the basic data structure to implement recently proposed error-tolerant prefix search methods. However, they must require a lot of extra memory to process queries. Burst tries are alternative compact tries proposed to reduce storage costs and maintain a close performance when compared to the use of tries. Here we discuss alternatives for adapting burst tries as error-tolerant prefix search data structures. We show how to adapt state-of-the-art trie-based methods for use with burst tries. We studied the trade-off between memory usage and time performance while varying the parameters to build the burst trie index. As an example, when indexing the JusBrasil dataset, one of the datasets adopted in the experiments, the use of burst tries reduces the memory required by a full trie to 26% and increases the time performance to 116%. The possibility of balancing memory usage and time performance constitutes an advantage of the burst trie when compared to the full trie when adopted as an index for the task of performing error-tolerant prefix search.

Compressed String Dictionaries via Data-Aware Subtrie Compaction

Compressed double-array tries for string dictionaries supporting fast lookup

Article 04 October 2016

Efficient String Similarity Search on Disks

1 Introduction

In this work we address the problem of performing an error-tolerant prefix search on a given set of string keys. Let $\Sigma$ be an alphabet. A string q is a sequence of $\Sigma$ symbols. We use |q| to denote the length of q, q[i] to denote the i-th symbol of q, starting at 1, and q[i..j] to denote a sub-string of q starting at position i and ending at position j. Let p and q be two distinct strings composed of the symbols in $\Sigma$. We say that p is a prefix of q if $p{=}q[1..|p|]$. When p is a prefix of q, we say that there is an exact prefix match between p and q.

When searching allowing errors, we need to define a metric to measure the distance between two compared strings p and q. Here we adopt the well-known edit distance, where the number of errors, or distance, is given by the minimum number of insertions, removals or substitutions of symbols needed to transform p into q. The number of errors to accept a match becomes a parameter. When the edit distance between p and q equals $\tau$, we say that p matches q with $\tau$ errors. If p matches with $\tau$ errors to any prefix of a string q, we say that there is an error-tolerant prefix match with $\tau$ errors between p and q.

Given the above concepts, we can now explain the problem addressed here. Let p be a string and let $Q{=}\{q_1,\ldots ,q_n\}$ be a set of strings to search for. The error-tolerant prefix search problem addressed here is to find all strings $q_i \in Q$ such that there is an error-tolerant prefix match between p and $q_i$ with a maximum number of errors $\tau$.

There are a variety of practical applications where an error-tolerant prefix search can be useful, such as query autocompletion and spelling correction. Our study is motivated by the application of error-tolerant query autocompletion, which is an essential and ubiquitous feature in the interaction between the user and the input interface of modern search engines (Smith et al., 2017; Wang & Lin, 2020; Tahery & Farzi, 2020; Krishnan et al., 2020; Kang et al., 2021). It can be seen as a specialized instance of the more general query suggestion problem, where suggestions need to be selected instantly based on the prefix queries generated as the user types (Chen et al., 2020). Figure 1a shows an example where the user has typed the prefix query “note”, and the system suggests possible queries that match it.

Error-tolerant query autocompletion can also be seen as a mechanism to help users spell difficult queries or correct typos when the user is writing a prefix query. An example of a search system that allows error-tolerant query autocompletion is shown in Fig. 1b, where the user receives suggestions “notebook dell”, “notebook samsung”, “notebook gamer”, “notebook acer” and “notebook lenovo”, all of which matching the misspelled typed prefix query “notebok”.

When searching in a system that allows query autocompletion, users may submit prefix queries containing typos that can result in unsatisfactory or even empty query suggestion results in an exact search system. Because of this, recent works have proposed error-tolerant prefix search algorithms for such applications (Chaudhuri & Kaushik, 2009; Ji et al., 2009; Li et al., 2011; Xiao et al., 2013; Deng et al., 2016; Zhou et al., 2016; Qin et al., 2019). In this particular application, the searched string can be large. For instance, one of the datasets adopted in our experiments contains more than 23 million strings.

Solutions that adopt indexes based on tries (Fredkin, 1960) are among the most successful ones. Despite their popularity and effectiveness, tries may require more memory space than the searched string set itself. To reduce the storage costs while maintaining good performance, Heinz et al. (2002) proposed a data structure called burst trie. Here we propose and study the use of burst tries to implement error-tolerant prefix search. We show that such an approach results in a competitive alternative to perform error-tolerant prefix search on large sets of strings, since it yields a reduction in the memory usage for query processing when compared to using full tries, while achieving a similar query processing time performance. Furthermore, the approach can be easily adopted for a large set of trie-based error-tolerant prefix search methods.

We study three different heuristics to burst containers when creating burst tries. The first heuristic, which we call Minimum Container Depth (MCD), limits the minimum depth of containers in the burst trie, while the second heuristic limits the maximum number of elements in each container. The second heuristic was proposed by Heinz et al. (2002) and is referred to here as Maximum Container Keys (MCK). We also study as a third heuristic the combination of MCD and MCK and present experiments showing that the studied alternatives produce a considerable reduction in memory usage for processing error-tolerant prefix search, while keeping the time performance close to that achieved by the full trie.

As a complementary study, we also investigate alternative ways to build tries used to perform error-tolerant prefix search, proposing a simple but effective way to organize trie nodes in memory when building the index. Existing algorithms usually work by inserting nodes into the tree one key at a time, a strategy we call DFS building. Here we experiment with another index building strategy, we call BFS building, where nodes are inserted level by level, rather than key by key. It requires the trie keys to be known in advance, since it requires the insertion of nodes to occur one level at a time. Also, nodes need to be sorted by the keys before insertion. We argue and experimentally show that this strategy, when applicable, can considerably reduce query processing times. Such gain is achieved because the BFS index building strategy favors breadth-first search (BFS) in the trie nodes, which is adopted by many of the previously proposed trie-based error-tolerant prefix search algorithms (Chaudhuri & Kaushik, 2009; Ji et al., 2009; Li et al., 2011; Deng et al., 2016; Zhou et al., 2016). Importantly, this performance improvement is achieved without requiring any changes to the query processing algorithm. In our experiments, using BFS when processing prefix queries with $\tau =3$ was more than twice as fast as using the DFS building strategy.

Our contributions can be summarized as follows:

We discuss and evaluate the application of burst tries in error-tolerant prefix search tasks.
We investigate the impact of building the trie using a BFS index building strategy as an alternative to the more intuitive DFS strategy for trie nodes allocation. Although it requires the keys to be sorted, we show that BFS, when applicable, can reduce query processing times.
We present experiments to verify the impact of using our ideas in practical scenarios applied to two distinct datasets adopted in previous studies and to a real-case dataset extracted from an online search service.

The rest of this paper is organized as follows. Sect. 2 presents the problem we have addressed, reviews the related work, and presents some definitions necessary to understand our proposed strategies and methods. It focuses on trie-based error-tolerant prefix search algorithms in the literature, including a brief description of the state-of-the-art BEVA algorithm. Sect. 3 presents basic concepts about tries and search operations on them. Sect. 4 presents our discussion about how to use burst tries as indexes to perform error-tolerant prefix search. Sect. 5 presents a discussion about practical implementation issues, especially a discussion about BFS and DFS trie building strategies. Sect. 6 presents experiments performed with the alternative burst tries implementation studied here and a comparison of their performance with representative baselines data structures. Finally, in Sect. 7, we present our conclusions and possible directions for future research.

2 Background and related work

Krishnan et al. (2017) define a set of query autocomplete modes based on how the characters already typed by the user are matched with the dataset of query suggestions. Each mode may result in different completions being presented. The taxonomy includes mode 1, where a prefix match is performed between the characters already typed by the user and each complete query suggestion in the dataset. Mode 2, where a prefix match is performed word by word. As a result, the system has a set of matches for each word already typed by the user, and the final suggestions are selected based on these sets. Mode 2 may, for instance, show only results that match all words already typed or use a ranking function to select the best results. Mode 3 also parses the query into words, but it allows the matching between the words at any position, so does not restrict the results to prefix matches.

Matching allowing errors could be applied to extend all the 3 initial modes (Krishnan et al., 2017). The choice of a mode is a design decision, since each mode may bring positive aspects and also negative aspects to the solution. As an example, Krishnan et al. (2017) describe that mode 3 allows finding a match between the query “gam rone” and the suggestion “game of thrones”. At a first glance it seems to be nice, but it might not be a good match and the decision depends on the user’s interest. For instance, the Google^{Footnote 1} search engine gives “gamerone” or “gam ronex” as suggestions for the string “gam rone”, and these might be better than “game of thrones”. The discussion above shows that all modes might be useful and interesting. This evidences that query autocompletion systems might be, for instance, implemented as a combination of distinct modes.

The discussion about how to implement error-tolerant prefix search using compact trie representations presented here is useful for modes that perform prefix search, especially modes 1 and 2 when allowing errors. We stress that the error-tolerant prefix search is just a small part of the query autocompletion systems. This is especially true when processing the search using mode 1, where the prefix search is performed over a smaller set of strings, the vocabulary containing the distinct words found in the dataset of suggestions, and where each word of the vocabulary is associated with an inverted list. In mode 2 the processing of the inverted lists not only may take more space than the vocabulary, but is also more expensive, see for instance the work of Gog et al. (2020) as an example of query autocompletion system that adopts mode 2.

Besides the matching mode adopted, autocompletion systems usually do not show all the matches to their users, which raises the necessity of providing a ranking to select the top results. Ranking can be, for instance, computed based on features such as frequencies of suggestions in the documents indexed by the system, click counts in the suggestions, number of errors in match mode 1, number of errors in match mode 2, information about the user who is typing the query and so on. Furthermore, when computing the ranking and the top results, the methods could apply pruning strategies to accelerate the computation of results. We neither discuss ranking strategies nor pruning strategies here but leave them as a future work.

Query autocompletion has been frequently studied in literature. Grabski and Scheffer (2004) studied the query autocompletion problem and proposed a retrieval model to select sentences to be shown to users from the ones that might complete the prefix query already typed. Bast and Weber (2006) (see also Bast et al. (2008)) proposed the Hyb data structure, a method to perform autocompletion in mode 2, processing queries word by word. Bast et al. (2021) show how to achieve autocompletion for SPARQL queries on very large knowledge bases. They do not mention error-tolerant prefix search algorithms in their work, but it could be impacted if using the data structures studied here for fast error-tolerant prefix search.

Nandi and Jagadish (2007) also studied the query autocompletion problem at the level of a multi-word phrase called mode 1, instead of completing words. They introduced a data structure named FussyTree to select autocomplete phrases for a given prefix. They introduce the concept of a significant phrase, which is used to demarcate frequent phrase boundaries from the possible suggestions. They have not implemented an error-tolerant prefix search.

Besides efforts to improve the efficiency of query autocompletion methods, there has been also much attention in the literature to improve the quality of results. Smith et al. (2017) carried out a detailed user study that shows the value of query autocompletion in shorter sessions and higher retrieval performance. Tahery and Farzi (2020) investigated the impact of customizing features related to time, location, context, and demographic features in this application. Kang et al. (2021) studied the problem of generating suggestions for query autocompletion, proposing a framework employing an n-gram language model at a subword-level to generate suggestions for prefixes not seen in the past. Cai and de Rijke (2016) proposed a learning to rank-based approach where features derived from homologous queries and semantically related terms are adopted to improve ranking quality. Cai and de Rijke (2016) also presented a detailed survey about query autocompletion in information retrieval.

When looking to the literature about error-tolerant prefix search methods applied to query autocompletion, several approaches (Chaudhuri & Kaushik, 2009; Ji et al., 2009; Li et al., 2011; Deng et al., 2016; Zhou et al., 2016; Qin et al., 2019) adopt tries (Fredkin, 1960), or their variations, as the search indexing structure. Typically, these methods traverse the trie using breadth-first search (BFS) and produce a list of results for each character typed by a user when submitting a prefix query. These methods maintain a set of active nodes that are associated with trie nodes and obtained with a match that supports a given error limit $\tau$. Several algorithms proposed in the literature use this approach and differ from each other by the strategy to maintain the set of active nodes.

Chaudhuri and Kaushik (2009) and Ji et al. (2009) proposed trie-based solutions that incrementally maintain a set of active nodes associated with the trie nodes. The methods process the matches using the trie as an automaton, activating or deactivating its nodes while processing the matches. For instance, when applying this method to query autocompletion, each character typed by the user might be processed as an input to update the list of active nodes. After updating the active nodes list for an already typed prefix, the result can be reported by taking all the leaf nodes that can be reached from the active nodes in the trie, and the list of active nodes can be used to update the results when a new symbol is added to the prefix query, as the user continues to type. While both methods use the same general strategy, Chaudhuri and Kaushik (2009) propose to partition all possible queries at a certain length into a limited number of equivalent classes (via reduction of the alphabet size) and previously compute the resulting active nodes for all these classes before. This strategy is a pre-computation step to quickly start the autocompletion and reduces the cost of maintaining the list of active nodes.

The number of active nodes can be extremely high when performing error-tolerant prefix search, which can slow down the search process. The subsequent research in the topic focused on reducing this number without impacting the final set of results. Li et al. (2011) proposed ICPAN, an alternative trie-based method to reduce the number of active nodes maintained by the method in Ji et al. (2009). This reduces memory consumption and query response time by only considering the subset of active nodes with the last characters being neither substituted nor deleted.

In another effort to reduce the costs for computing active nodes, Deng et al. (2016) proposed META, which features the ability to support top-k query matches. Deng et al. (2016) designed a compact tree index to maintain the active nodes to avoid the redundant computations that occur in previous methods.

Zhou et al. (2016) propose BEVA, another trie-based method that uses an even more efficient evaluation strategy for the active nodes, which speeds up query processing by entirely eliminating ancestor–descendant relationships among active nodes. The key idea is to store the edit vector values of each active node, which allows them to store a minimal set of active nodes required to perform the edit distance computation, the so-called boundary active nodes. BEVA is the algorithm adopted in our study about how to perform efficient error-tolerant prefix search on burst tries, and we thus better detail it in the next section.

Hu et al. (2018) proposed a trie-based method that allows combining location aware and error-tolerant query autocompletion. Wang and Lin (2020) extended the ICPAN (Li et al., 2011) method and propose a method called AutoEL to support error-tolerant location-aware query autocompletion. The error-tolerant feature is enabled by applying the edit distance to evaluate the textual similarity between a given query and the underlying data, while the location-aware feature is taken by choosing the k-nearest neighbors. Like ICPAN, AutoEL is a trie-based method and can take advantage of the ideas we propose in this paper.

Trie is a fast data structure and represents a good alternative for building error-tolerant query autocompletion systems, but is also space-intensive. To reduce the storage costs while keeping a good performance of tries, Heinz et al. (2002) proposed a data structure referred to as burst trie. Burst tries are collections of small data structures, called containers, that are accessed via a conventional trie, called access trie. Searching involves using the initial characters of a query string to identify a particular container, then using the remainder of the query string to find a record in the container. Heinz et al. (2002) have experimented alternative data structures to store information on each container and reported that using a binary search tree was a competitive alternative.

Several researches in literature have previously shown that taking care of cache hierarchy may largely improve the performance of algorithms that deal with tries and burst tries. Acharya et al. (1999) present cache-efficient algorithms for trie search. They use different data structures (partitioned-array, B-tree, hashtable, vectors) to represent different nodes in a trie. They also adapt to changes in the fanout at a node by dynamically switching the data structure used to represent it.

Inspired by the success of previous work that explored the cache hierarchy to improve the performance of tries, we here include in our contributions a discussion about how to build tries and burst tries in a cache-friendly approach designed specifically for the error-tolerant prefix search. We discuss the application of burst tries as a possible data structure for processing error-tolerant prefix search. Burst tries were originally developed to provide fast exact dictionary matches. We here discuss alternative burst heuristics and container storage data structures for applying burst tries as indexes for error-tolerant prefix search.

Part of our study was focused on finding efficient ways of implementing the tries and burst tries. Issues on the efficient implementation of tries have been studied in the literature since they were first proposed (Fredkin, 1960). Morrison (1968) proposed the Practical algorithm to retrieve information coded in alphanumeric, or Patricia trie. In summary, a Patricia trie is a trie where the symbols are represented in bits, becoming a binary tree, and where the nodes represent only the positions where the keys differ from each other. As a result, Patricia tries considerably reduce storage costs, at a price of increasing the computational cost for search in the data structure when compared to a conventional trie.

McCreight (1976) introduced the compact versions of a trie that we name here as compact prefix trees (CPT), and that are also known as prefix trees, or tries or compact suffix trees (Clark, 1998). The compact prefix tree reduces the storage requirement of a regular trie by removing the degree one node. Nodes containing just one child have this child collapsed to them. Edge labels of a compact prefix tree represent a sequence of characters, while edge labels in the trie represent just one character. Notice this change increases the storage cost of each node, but on the other hand, it substantially reduces the number of nodes of the compact prefix tree compared to the trie. In this paper, we present experiments comparing implementations of error-tolerant prefix search, both with tries and compact prefix trees.

When a trie or any of its variants is used to index distinct suffixes of the indexed strings, it can be called a suffix tree. These structures usually index all the possible suffixes of each indexed string, becoming space expensive. Compact versions are even more important when creating suffix trees. Manber and Myers (1993) introduced a representation to represent a suffix tree that stores all the suffixes in an array, referred to as a suffix array. This data structure was also created in a parallel research (Gonnet et al., 1992). It is a sorted array of all the suffixes of a string. Abouelhoda et al. (2004) present a detailed discussion about how to use suffix arrays as a substitute to several applications of suffix trees.

Several works in literature have discussed how to use suffix arrays for performing error-tolerant string search. The algorithms usually break the search string into consecutive and non-overlapping sub-strings named as n-grams. Exact matches between the n-grams of the search string and the suffixes indexed are used to detect matches with errors between the whole searched string and text positions (Navarro et al., 2000, 2005).

Darragh et al. (1993) proposed the Bonsai trie, a trie representation where the nodes are maintained in a compact global structure, a hash table, that stores all the nodes of the trie. This allows a reduction in the space required to store each trie node. Darragh et al. (1993) discuss all iterations of implementing tries and compare to their implementation using a global hash. Here we adopt the idea of creating a global data structure to both reduce storage costs and accelerate the access to trie nodes.

Besides the compact representation, other efficient implementation of tries are discussed in several contexts of applications in the literature, including name lookup in networks (Ghasemi et al., 2018; Xie et al., 2017), general database and dictionary search (Bender et al., 2002; Binna et al., 2018) and bioinformatics (Holley et al., 2016), among others. However, we have not found specific related work discussing efficient trie building for optimizing query autocompletion tasks. As we show here, we can considerably speed up the query autocompletion search when taking into account specific characteristics of such an application when building the trie.

Other recent work has also proposed compact and efficient trie variations, but none of them addressing error-tolerant prefix search. Belazzougui et al. (2010) and Jansson et al. (2015) presented compact trie variations to produce fast and compact data structures to allow fast exact prefix match in dynamic environments, with special attention to tries adopted in efficient implementation of online Lempel Ziv text factorization Ziv and Lempel (1977).

Kanda et al. (2020) use a technique called path decomposition to construct cache-friendly tries that are compact and fast. Path decomposition compresses the trie by modifying its structure by first choosing a root-to-leaf path in the original trie and then associating this path with a root of a new trie. They describe how to perform an exact string search in their structure, while we are interested in performing error-tolerant prefix search.

3 Tries as indexes for string search

Tries are search trees in which the keys are usually strings with symbols from a predefined alphabet $\Sigma$, where each character of the string is stored as a label on an edge. In a trie, each path from the root to a leaf represents a string. Consider a dataset of example containing strings {“autobus$”, “autonomy$”, “book$”, “auto_off$”, “cat_dog$”, “cattail$”, “cattle$”, “cat_food$”}, with ‘$’ used to indicate end of string and ‘_’ representing a blank space, illustrated in Table 1.

Table 1 Sample dataset

Applying burst-tries for error-tolerant prefix search

Abstract

Similar content being viewed by others

Compressed String Dictionaries via Data-Aware Subtrie Compaction

Compressed double-array tries for string dictionaries supporting fast lookup

Efficient String Similarity Search on Disks

1 Introduction

2 Background and related work

3 Tries as indexes for string search

3.1 Exact prefix search in tries

3.2 Error-tolerant prefix search in tries

4 Error-tolerant prefix search using burst tries

4.1 Burst heuristics studied

4.2 Viewing containers trees

5 Efficiency issues

5.1 Reducing fetching costs

5.2 BFS index building

6 Experiments

6.1 Experiments setup

6.1.1 Datasets

6.2 Indexing alternatives studied

6.2.1 Evaluation trie building optimizations

6.2.2 Burst trie parameters selection

6.3 Comparing burst trie with other trie representations

6.3.1 Performance when varying prefix sizes and number of errors

6.3.2 How do methods affect scalability ?

6.3.3 Performance when increasing the size of the dataset

6.4 Comparing trie indexes in mode word by word

6.5 Experiments with DBLP and UMBC datasets

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation