Dynamic optimization of queries in pivot-based indexing
- 870 Downloads
This paper evaluates the use of standard database indexes and query processing as a way to do metric indexing in the LAESA approach. By utilizing B-trees and R-trees as pivot-based indexes, we may use well-known optimization techniques from the database field within metric indexing and search. The novelty of this paper is that we use a cost-based approach to dynamically evaluate which and how many pivots to use in the evaluation of each query. By a series of measurements using our database prototype we are able to evaluate the performance of this approach. Compared to using all available pivots for filtering, the optimized approach gives half the response times for main memory data, but much more varied results for disk resident data. However, by use of the cost model we are able to dynamically determine when to bypass the indexes and simply perform a sequential scan of the base data. The conclusion of this evaluation is that it is beneficial to create many pivots, but to use only the most selective ones during evaluation of each query. R-trees give better performance than B-trees when utilizing all pivots, but when being able to dynamically select the best pivots, B-trees often provide better performance.
KeywordsSimilarity search Pivot-based indexing Database trees Optimized query processing
Similarity search is gaining interest both for structured and unstructured objects. It is also important in domains where canonical ordering of data is not possible, for instance multidimensional vector spaces or general metric spaces. The domains we will investigate contain either large vector spaces, or prohibitively expensive exact distance calculations, making a full scan to answer similarity queries costly. A query is typically formalized as a sample object, and the query is evaluated against a database of objects by issuing comparison of similarity.
There are many applications of metric indexing and search. These range from entertainment and multimedia to science and medicine, or applications that require efficient query-by-example, but where traditional spatial access methods cannot be used. Beyond direct search, similarity retrieval can be used internally in a wide spectrum of systems, such as nearest neighbor classification, compressed video streaming and multiobjective optimization .
Our motivation is to exploit database internals such as index structures and query processing as a means to solve similarity queries. We have chosen to work with the pivot-based LAESA approach to metric indexing and search . There are several existing works that have done similar research [14, 15, 23], and we build on these by exploiting more database internal techniques, having direct access to indexes, buffers and algebraic query processing capabilities.
Our approach has been to exploit B+-trees and R*-trees and to use parallel hash joins in between sequential range scans of these ordered indexes. Our conclusion is that R*-trees seem to be best, especially at querying time. The use of these indexes is dependent on the query, i.e., the larger the range limit of the query, the less useful the indexes become. Furthermore, the query will be most efficiently processed using statistics on how many and which indexes to use. This is well known in database systems since the System-R days .
We have exploited this idea and created statistics for each access path, both for scans, B+-trees and R*-trees. Before we evaluate a query we calculate the estimated cost of executing the query using different access paths. This allows us to use the most optimal query execution strategy based on a dynamic optimization for each query. When sequential scans are estimated to be cheapest, this will be chosen to evaluate the query.
The main contribution of this paper is the application of statistics to support query evaluation. This is a technique that could be of use to several existing metric indexing methods, but rather than trying to compare the relative merits of of such methods, we have focused on one approach, to to demonstrate the potential for improvement. We use a pivot filtering approach similar to OMNI , because of its simplicity, and because the use of B+-trees and R*-trees highlights the strenghts of our method.
The organization of this paper is as follows. We start by comparing our approach to similar works on database indexes and pivot selections. Then, a description of the architecture and the design is given. We present a set of initial results using our database prototype. The results of our optimization is presented and, finally, some concluding remarks are given and directions for further research are outlined.
2 Related work
The starting point for this research is LAESA , which is based on pre-computed distances between the objects in the database. Instead of indexing the distance between all pairs of objects, as done in the AESA approach , only a fixed-size subset is used as sample objects, or pivots. AESA is regarded as the best method for filtering, but it relies on storing pre-computed distances between all object. LAESA, which relies on using a reduced pivot set, lowers the time and memory complexity, while increasing the number of distance calculations needed during query processing. This could be a trade-off between indexing time and memory usage against query processing time. The LAESA algorithm starts with one arbitrary pivot and with this it scans every object to eliminate and select candidates. At the same time it applies a heuristic to select the next pivot to use for elimination and selections. Similar online pivot selection approaches are used in AESA and iAESA  as well. Unlike these methods, we store the distances from a pivot to all other object in an ordered index, which allows us to skip large parts of the object scans done in LAESA. Since LAESA calculates the next pivot based on the previous ones, it is also somewhat harder to parallelize.
Spaghettis  is similar to LAESA, but this method sorts the distances columnwise, and uses binary search to find ranges of candidate objects. Furthermore, Spaghettis uses links in-between the columns such that an object in one column is linked to the same object in the next column. Unlike Spaghettis, we create these links after filtering by using standard database joins.
The work that is most similar to ours is the OMNI approach . OMNI is based on the selection of several foci (pivots), and indexing the distance from all objects to all pivots. Metric range queries can then be performed using range queries from each focus and intersecting the results, while kNN queries can be performed with a predefined or estimated range query followed by a post processing step. OMNI concludes that the number of pivots should follow the intrinsic dimensionality of the data set, while our results show that the query and its radius is very important on deciding which pivots and which number of pivots to use.
iDistance  is based on using several pivots and partitioning the data set according to distance to the nearest pivot. The unique aspect of iDistance is how the distances are stored. The distances between pivots and their data objects are stored in intervals as a part of one large B+-tree. MB+-tree  stores two different trees: a B+-tree for the objects and an additional main-memory block tree that is used to partition the B+-tree. None of these two approaches allows for using cost estimation and dynamic optimization as a technique to improve the efficiency of query processing.
Bustos et al.  propose to select a set of pivots carefully by having an efficiency criterion that maximizes the probability of filtering. Based on this an incremental algorithm is developed that allows to add pivots dynamically to the index. SSS  is another dynamic algorithm for off-line selecting pivots. An object is chosen as a new pivot at insertion time, if it is far enough away from the existing pivots. Empirically they have found “far enough” to be Mα, where M is the maximum distance found and α is a number in the range from 0.35 to 0.40. These two works are combined by Bustos et al. . When a new pivot is to be added to the index, it is checked whether any existing pivots has become redundant, i.e., its contribution has become low according to the efficiency criterion. All these three works show how to select good pivots off-line from a large set of pivots. This is orthogonal to our work, in the sense that our algorithms may work with any set of pivots.
Ciaccia et al.  show how to estimate the cost of query evaluation using M-trees. This information is used to tune the M-tree to minimize CPU and I/O cost. However, they do not consider pivot-based methods, where you may use this to dynamically optimize the processing of each query.
Baioco et al.  argue that the selectivity estimation techniques should consider the distribution given by the intrinsic dimension, which is usually lower than the representational dimension. They develop a cost estimation technique based on the correlation fractal dimension for Slim trees .
Fredriksson  extends the Hierarchy of Clusters (HC) tree to handle varying query radii. In other words, the optimal degree of unbalance in the structure depends on the query radius, so several indexes are built, and the one that will work best for a given query is selected. This is similar to our work; however, we may use the same indexes for any query radius.
3 Pivot-based indexing and querying
The indexing method is based on LAESA  where precomputed distance calculations are the foundation. The distance from some of the data objects, the pivots, are precomputed to all other objects in the database. These distances are stored such that traditional database-type range queries are used to filter the data objects, resulting in a small set of objects that will need exact distance calculations.
3.1 Metric spaces
3.3 Filtering and query evaluation
Filtering is the most important operation in a metric indexing system, and is where the different indexing methods we propose are important.
The filtering process must be supplied with two parameters, the query object q and a range limit r. In addition a set of indexes supporting range queries are needed. The process starts by performing a range scan on each index file. For a data object to be within the given range, the distance from the pivot to the data object must also be within a given range. This is given by the triangle inequality, and results in the inclusion only of objects oi that satisfy |d(p, q) − d(p, oi)| ≤ r, where p is a pivot object. As d(p, oi) is pre-calculated and stored in the index, only d(p, q) must be calculated and all objects between d(q, p) − r and d(q, p) + r are returned as candidates for this pivot. By combining the filtering of several pivots, we may get a candidate set that is small.
For metric range queries the intersection of candidate sets are of interest. The candidate set returned from each index file is joined with the candidate set of every other index file. Only the objects that exists in every candidate set is returned. After doing this filtering we need to calculate the exact distance to every object in the resulting candidate set, the post processing step. This could be a costly step, depending on the complexity of the objects in question. With respect to Fig. 1 this means to take every object inside the dotted area and determine whether they are inside the circle with the broken perimeter.
4 Initial performance experiments
We have performed a set of initial measurements to evaluate the use of database indexes to support metric indexing and search.
We have built a small database kernel for the sake of research on databases and search technology. The aim is to easily build new indexes and search algorithms to do experiments.
NTNUStore is based on NEUStore . NTNUStore is a Java library made to experiment with query processing, buffer management and indexes. Currently, we have R*-trees and B+-trees as indexes. On the query processing side we have focused on efficient range scans and on parallel hash join processing.
The B+-trees are implemented for insertion and updates, but without support for node deletion. For our application this suffices because we never delete data from the database. Our B+-tree allows for duplicate keys because the keys in the indexes are distances, which may very well be equal. For some type of data, e.g., document similarity, the distances are not well distributed, which gives many equal distances. Our B+-tree assumes random insertions and splits blocks in the middle. The records in this application are of equal size, so a simple number-wise middle is utilized. The B+-tree data are kept as objects when residing in memory, but are converted to a serialized form when written to disk. The opposite happens when a block is read from disk into memory.
The R*-trees are chosen due to their way of doing insertions and block splits, where the optimal way is selected according to what resides in the R-tree block. The R*-tree is more CPU intensive during insertions than the B+-tree. This is mainly due to the CPU intensive algorithm for calculating minimum overlap between minimum bounding rectangles in the R-tree blocks. Like the B+-tree, the R*-tree also does a serialization to/from memory objects while being written and read to/from disk.
The records in the R*-tree are minimum bounding rectangles, MBRs. Our R*-tree uses two different criteria for choosing subtrees at insertions: Minimization of overlap of areas when operating at the leaf level, and minimization of the areas covered by each MBR when being at a non-leaf level .
4.3 Experiment setup
Initial values for experiments
Number of queries
We have used the NASA data found at the SISAP Metric Space library , which is a set of 40,150 20-dimensional feature vectors, generated from images downloaded from NASA and with duplicate vectors eliminated. This set of data fits in the buffer of the database. We have used two different distances on this set of data, both the traditional euclidean distance (L2) and the quadratic form distance (QFD) .
For querying, we separate into the filtering and the post-processing phase. A higher number of pivots will give better filtering, resulting in less post-processing, but increases the filtering cost.
In our experiments we have used a static selection of pivots which is based on the one used in OMNI . The basic principle is to maximize the distance between the pivots. We calculate the pivots before starting to insert the objects. When trying to find a new pivot our algorithm maximizes the distance from the current set of pivots to the candidate pivot.
4.4 Number of pivots
Average size of result of filtering
This shows that for evaluating a query, the query radius must be considered when deciding how many pivots and which indexes to scan. The intrinsic dimensionality of the NASA data using Euclidean distance is 5.2, according to the formula found in Chávez et al. . Our results suggest that not only the dimensionality of the data is important in deciding the optimal number of pivots, but the query as well. This may be addressed by having many pivots, but to use only the optimal number and the most selective indexes when issuing a query, i.e., the indexes that retrieve fewest blocks from the database for that query. To implement this we need to maintain statistics about the selectivity of the index, i.e., by maintaining equi-depth histograms.
5 Dynamic optimization – access path selectivity
Based on the initial runs shown we have seen that the optimal number of pivots is dependent on the query radius. This gave us the idea that the number of pivots, and which pivots to use, could be decided at run time for each query. This could be done by using traditional database optimizations techniques. By estimating the selectivity of different access paths, we could choose to use the most optimal ones for each query. Furthermore, by applying full cost estimation of different query evaluation plans we could decide on how many pivots to use in advance. Because each pivot is just a filter to remove irrelevant results, we may freely decide how many to use.
To do this we need to maintain statistics about each access path in the database. We choose to use equi-depth histograms to represent statistics for the data distribution for each access path [13, 19]. These statistics are well suited for estimating the size of the result set.
For B+-trees we create equi-depth histograms simply by scanning blocks at level 1, the level above the leaves. We assume each leaf level block to contain the same number of records. When estimating the cost of a range query for a specific B+-tree, we calculate the distance and estimate how large portion of the B+-tree is within the range by counting the number of bins.
For R*-trees we do something similar to B+-trees, but in this case we find the fraction of overlaps between the query’s region and the regions of the records at level 1 in the R-tree. This gives a reasonable estimate of the size of the candidate set for each R*-tree. The experiments with the NASA data suggests that this is a sufficient method for estimation of the selectivity for each R*-tree.
6 The effect of dynamic optimization
We have run a few sets of measurements of the dynamic optimization. In the two first sets we have used the NASA data, 12 pivots and 50 queries. All results here show the accumulated query time for 50 queries. This includes both filtering and post-processing times, where the query is compared with the candidate result set. For the dynamic optimization methods we have, of course, included the time to select indexes in the query times.
Different access paths used in the experiments
Sequential scan of all objects
Filtering using all 3 R*-trees
Filtering using the most selective R*-tree
Filtering using the two most selective R*-trees
Filtering using all 12 B+-trees
Filtering using the most selective B+-tree
Filtering using the two most selective B+-trees
Filtering using the three most selective B+-trees
Filtering using the 4 most selective B+-trees
Filtering using the 5 most selective B+-trees
We have chosen to also measure the cost of sequential scan, because this is often the best solution when the distance function is cheap . In our experiments the objects themselves are stored in a separate data B+-tree. Sequential scan is supported by creating a cursor on the leaf level of this B+-tree.
Query time for QFD using NASA data
We have merely used the quadratic form distance (QFD) as an example of an expensive comparison function. In our measurements we have used the identity matrix, thus letting the similarity score be equal to the traditional L2 distance.
Query time for L2 using NASA data
In this experiment the most optimal solution is to use the most selective R*-tree (77 milliseconds). Sequential scan is 213 milliseconds and the best B+-tree solution is using the two most selective B+-trees (192 milliseconds). Using all 12 pivots is not optimal here either, but when using R*-trees it is still better than sequential scan.
Which B+-tree is most selective?
Varying radii using the NASA data
r = 0.05
r = 0.1
r = 0.2
r = 0.4
r = 0.8
Query time for QFD using COLORS data
Query time for QFD using COLORS data with 24 available pivots
Query time for QFD using NASA with small buffer
Query time for L2 using vectors data
All in all, by using statistics we are able to pick the most selective access paths for the queries issued, resulting in better response time. According to our measurements it doubles the performance for main memory data, and may give some improvements for disk resident data, but will often rely on sequential scan when there is bad selectivity in the indexes for the queries issued.
7 Conclusions and further work
The basic idea behind our research was to exploit knowledge of database structures and processing to support similarity search. We chose the LAESA method as a testbed for our approach.
We have performed experiments with various parameters and access structures. The initial conclusion is that R-trees seem to be the winner. This is mainly due to the fact that many pivots are pre-joined in each R-tree. It also has less demand on memory and disk I/O, because there are fewer blocks to scan. However, when being able to dynamically choose the most selective pivots, B+-trees sometimes provide better performance because there are more pivots to choose from.
We discovered that the optimal number of pivots to be used is very dependent on the distribution of the data and on the query itself, i.e., the range limit used. Therefore, the system needs to consider how many and which of the indexes to use when evaluating the query. This is done by maintaining statistics, equi-depth histograms, for each index and by using a cost model. By doing this we were able to chose the most selective indexes for each query dynamically. Our performance measurements show that this gives better performance than using a fixed set of pivots for most types and number of indexes we have tested. For disk resident data sequential scan is often the best solution.
By registering which pivots are most selective according to a query log, we could dynamically remove some pivots and try to create some new better ones. Our current plan is to extend our work by performing experiments with different types of data. By this we hope to gain further insight into the area and possibly to improve our method. We also plan to integrate the similarity search with traditional database type of queries, such that it becomes an integrated platform for the next generation search.
In our experiments, we have demonstrated improvements over direct pivot filtering, using all pivots, when applied in an OMNI-like setting. While this is perhaps the setting that most resembles the origins of our selection method, the method may well have wider applicability. In the future, it would be interesting to examine whether similar statistics-based online pivot selection would be beneficial in other indexing methods, where the pivot filtering is based in in-memory distance tables.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- 3.Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the 7th international conference on database theory. Lecture Notes In Computer Science, vol 1540. Springer-Verlag, London, UK, pp 217–235Google Scholar
- 6.Chávez E, Marroquín JL, Baeza-Yates R (1999) Spaghettis: an array based algorithm for similarity queries in metric spaces. In: Proceedings of the string processing and information retrieval symposium & international workshop on groupware (SPIRE). IEEE Computer Society, pp 38–46Google Scholar
- 8.Ciaccia P, Patella M, Zezula P (1998) A cost model for similarity queries in metric spaces. In: Proc. 17th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS’98), pp 59–68Google Scholar
- 9.Figueroa K, Chávez E, Navarro G, Paredes R (2006) On the least cost for proximity searching in metric spaces. In: Àlvarez C, Serna M (eds) Proceedings of the 5th international workshop on experimental algorithms. Lecture notes in computer science, vol 4007. Springer, pp 279–290Google Scholar
- 10.Figuerora K, Navarro G, Chavez E (2010) SISAP: metric space library. http://sisap.org/Home.html
- 12.Hetland ML (2009) The basic principels of metric indexing. In: Coello Coello C, Dehuri S, Ghosh S (eds) Swarm intelligence for multi-objective problems in data mining, 2009. Published by Springer-Verlag, Springer-VerlagGoogle Scholar
- 13.Ioannidis Y (2003) The history of histograms (abridged). In: VLDB ’2003: proceedings of the 29th international conference on very large data bases. VLDB Endowment, pp 19–30Google Scholar
- 14.Ishikawa M, Chen H, Furuse K, Yu JX, Ohbo N (2000) Mb+tree: a dynamically updatable metric index for similarity searches. In: WAIM ’00: proceedings of the first international conference on web-age information management. Springer-Verlag, London, UK, pp 356–373Google Scholar
- 16.Manolopoulos Y, Nanopoulos A, Papadopoulos AN, Theodoridis Y (2005) R-Trees: theory and applications (advanced information and knowledge processing), 1st edn. SpringerGoogle Scholar
- 24.Zhang D (2008) Neustore: a simple java package for the construction of disk-based, paginated, and buffered indices. http://www.ccs.neu.edu/home/donghui/research/neustore/