Aggregation Query, Spatial
Synonyms
Definition
Given a set O of weighted point objects and a rectangular query region r in the d-dimensional space, the spatial aggregation query asks the total weight of all objects in O which are contained in r.
This query corresponds to the SUM aggregation. The COUNT aggregation, which asks for the number of objects in the query region, is a special case when every object has equal weight.
The problem can actually be reduced to a special case, called the dominance-sum query. An object o1 dominates another object o2 if o1 has larger value in all dimensions. The dominance-sum query asks for the total weight of objects dominated by a given point p. It is a special case of the spatial aggregation query, when the query region is described by two extreme points: the lower-left corner of space and p.
The spatial aggregation query can be reduced to the dominance-sum query in the 2D space, as illustrated below. Given a query region r (a 2D
Historical Background
In computational geometry, to answer the dominance-sum query, an in-memory and static data structure called the ECDF-tree (Bentley 1980) can be used. The ECDF-tree is a multi-level data structure, where each level corresponds to a different dimension. At the first level (also called main branch), the d-dimensional ECDF-tree is a full binary search tree whose leaves store the data points, ordered by their position in the first dimension. Each internal node of this binary search tree stores a border for all the points in the left subtree. The border is itself a (d-1)-dimensional ECDF-tree; here points are ordered by their positions in the second dimension. The collection of all these border trees forms the second level of the structure. Their respective borders are (d-2)-dimensional ECDF-trees (using the third dimension and so on). To answer a dominance-sum query for point p = (p1, …, p d ), the search starts with the root of the first level ECDF-tree. If p1 is in the left subtree, the search continues recursively on the left subtree. Otherwise, two queries are performed, one on the right subtree and the other on the border; the respective results are then added together.
In the fields of GIS and spatial databases, one seeks for disk-based and dynamically updateable index structures. An approach is to externalize and dynamize the ECDE-tree. To dynamize a static data structure, some standard techniques can be used (Chiang and Tamassia 1992), for example, the global rebuilding (Overmars 1983) or the logarithmic method (Bentley and Saxe 1980). To externalize an internal memory data structure, a widely used method is to augment it with block-access capabilities (Vitter 2001). Unfortunately, this approach is either very expensive in query cost, or very expensive in index size and update cost.
Another approach to solve the spatial aggregation query is to index the data objects with a multidimensional access method like the R∗-tree (Beckmann et al. 1990). The R∗-tree (and the other variations of the R-tree) clusters nearby objects into the same disk page. An index entry is used to reference each disk page. Each index entry stores the minimum bounding rectangle (MBR) of objects in the corresponding disk page. The index entries are then recursively clustered based on proximity as well. Such multidimensional access methods provide efficient range query performance in that subtrees whose MBRs do not intersect the query region can be pruned. The spatial aggregation query can be reduced to the range search: retrieve the objects in the query region and aggregate their weights on the fly. Unfortunately, when the query region is large, the query performance is poor.
An optimization proposed by Lazaridis and Mehrotra (2001) and Papadias et al. (2001) is to store, along with each index entry, the total weight of objects in the referenced subtree. The index is called the aggregate R-tree, or aR-tree in short. Such aggregate information can improve the aggregation query performance in that if the query region fully contains the MBR of some index entry, the total weight stored along with the index entry contributes to the answer, while the subtree itself does not need to be examined. However, even with this optimization, the query effort is still affected by the size of the query region.
Scientific Fundamentals
This section presents a better index for the dominance-sum query (and in turn the spatial aggregation query) called the Box-Aggregation Tree, or BA-tree in short.
The BA-tree is an augmented k-d-B-tree (Robinson 1981). The k-d-B-tree is a disk-based index structure for multidimensional point objects. Unlike the R-tree, the k-d-B-tree indexes the whole space. Initially, when there are only a few objects, the k-d-B-tree uses a single disk page to store them. The page is responsible for the whole space in the sense that any new object, wherever it is located in space, should be inserted to this page. When the page overflows, it is split into two using a hyperplane corresponding to a single dimension. For instance, order all objects based on dimension one and move the half of the objects with larger dimension-one values to a new page. Each of these two disk pages is referenced by an index entry, which contains a box: the space the page is responsible for. The two index entries are stored in a newly created index page. As more split happens, the index page contains more index entries.
The BA-tree is a k-d-B-tree with augmented border information. (a ) Points affecting the subtotal of F. (b ) Points affecting the x-border of F. (c ) Points affecting the y-border of F
As done in the ECDE-tree, each index record in the k-d-B-tree can be augmented with some border information. The goal is that a dominance-sum query can be answered by following a single subtree (in the main branch). Suppose in Fig. 1a, there is a query point contained in the box of record F. The points that may affect the dominance-sum query of a query point in F.box are those dominated by the upper-right point of F.box. Such points belong in four groups: (1) the points contained in F.box; (2) the points dominated by the low point of F (in the shadowed region of Fig. 1a); (3) the points below the lower edge of F.box (Fig. 1b); and (4) the points to the left of the left edge of F.box (Fig. 1c).
To compute the dominance-sum for points in the first group, a recursive traversal of subtree(F) is performed. For points in the second group, in record F a single value (called subtotal) is kept, which is the total value of all these points. For computing the dominance-sum in the third group, an x-border is kept in F which contains the x positions and values of all these points. This dominance-sum is then reduced to a 1D dominance-sum query for the border. It is then sufficient to maintain these x positions in a 1D BA-tree. Similarly, for the points in the fourth group, a y-border is kept which is a 1D BA-tree for the y positions of the group’s points.
To summarize, the 2D BA-tree is a k-d-B-tree where each index record is augmented with a single value subtotal and two 1D BA-trees called x-border and y-border, respectively. The computation for a dominance-sum query at point p starts at the root page R. If R is an index node, it locates the record r in R whose box contains p. A 1D dominance-sum query is performed on the x-border of r regarding p. x. A 1D dominance-sum query is performed on the y-border of r regarding p. y. A 2D dominance-sum query is performed recursively on page(r.child). The final query result is the sum of these three query results plus r.subtotal.
The insertion of a point p with value v starts at the root R. For each record r where r.lowpoint dominates p, v is added to r.subtotal. For each r where p is below the x-border of r, position p. x and value v are added to the x-border. For each record r where p is to the left of the y-border of r, position p. y and value v are added to the y-border. Finally, for the record r whose box contains p, p and v are inserted in the subtree(r.child). When the insertion reaches a leaf page L, a leaf record that contains point p and value v is stored in L.
Since the BA-tree aims at storing only the aggregate information, not the objects themselves, there are chances where the points inserted are not actually stored in the index, thus saving storage space. For instance, if a point to be inserted falls on some border of an index record, there is no need to insert the point into the subtree at all. Instead, it is simply kept in the border that it falls on. If the point to be inserted falls on the low point of an internal node, there is even no need to insert it in the border; rather, the subtotal value of the record is updated.
The BA-tree extends to higher dimensions in a straightforward manner: a d-dimensional BA-tree is a k-d-B-tree where each index record is augmented with one subtotal value and d borders, each of which is a (d-1)-dimensional BA-tree.
Key Applications
One key application of efficient algorithms for the spatial aggregation query is interactive GIS systems. Imagine a user interacting with such a system. She sees a map on the computer screen. Using the mouse, she can select a rectangular region on the map. The screen zooms in to the selected region. Besides, some statistics about the selected region, e.g., the total number of hotels, total number of residents, and so on, can be quickly computed and displayed on the side.
Another key application is in data mining, in particular, to compute range sums over data cubes. Given a d-dimensional array A and a query range q, the range-sum query asks for the total value of all cells of A in range q. It is a crucial query for online analytical processing (OLAP). The best known in-memory solutions for data cube range-sum appear in Chung et al. (2001) and Geffner et al. (2000). When applied to this problem, the BA-tree differs from Geffner et al. (2000) in two ways. First, it is disk based, while (Geffner et al. 2000) presents a main-memory structure. Second, the BA-tree partitions the space based on the data distribution, while (Geffner et al. 2000) does partitioning based on a uniform grid.
Future Directions
The update algorithm for the BA-tree is omitted from here, but can be found in Zhang et al. (2002). Also discussed in Zhang et al. (2002) are more general queries, such as spatial aggregation over objects with extent.
The BA-tree assumes that the query region is an axis-parallel box. One practical direction of extending the solution is to handle arbitrary query regions, in particular, polygonal query regions.
Cross-References
References
- Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The R∗-tree: an efficient and robust access method for points and rectangles. In: SIGMOD, Atlantic City, pp 322–331Google Scholar
- Bentley JL (1980) Multidimensional divide-and-conquer. Commun ACM 23(4):214–229MathSciNetMATHCrossRefGoogle Scholar
- Bentley JL, Saxe NB (1980) Decomposable searching problems I: static-to-dynamic transformations. J Algorithms 1(4):301–358MathSciNetMATHCrossRefGoogle Scholar
- Chiang Y, Tamassia R (1992) Dynamic algorithms in computational geometry. Proc IEEE Spec Issue Comput Geom 80(9):1412–1434Google Scholar
- Chung C, Chun S, Lee J, Lee S (2001) Dynamic update cube for range-sum queries. In: VLDB, Roma, pp 521–530Google Scholar
- Geffner S, Agrawal D, El Abbadi A (2000) The dynamic data cube. In: EDBT, Konstanz, pp 237–253Google Scholar
- Lazaridis I, Mehrotra S (2001) Progressive approximate aggregate queries with a multi-resolution tree structure. In: SIGMOD, Santa Barbara, pp 401–412Google Scholar
- Overmars MH (1983) The design of dynamic data structures. Lecture notes in computer science, vol 156. Springer, HeidelbergGoogle Scholar
- Papadias D, Kalnis P, Zhang J, Tao Y (2001) Efficient OLAP operations in spatial data warehouses. In: Jensen CS, Schneider M, Seeger B, Tsotras VJ (eds) Advances in spatial and temporal databases, 7th international symposium, SSTD 2001, Redondo Beach, July 2001. Lecture notes in computer science, vol 2121. Springer, Heidelberg, pp 443–459Google Scholar
- Robinson J (1981) The K-D-B tree: a search structure for large multidimensional dynamic indexes. In: SIGMOD, Orlando, pp 10–18Google Scholar
- Vitter JS (2001) External memory algorithms and data structures. ACM Comput Surv 33(2):209–271CrossRefGoogle Scholar
- Zhang D, Tsotras VJ, Gunopulos D (2002) Efficient aggregation over objects with extent. In: PODS, Madison, pp 121–132Google Scholar
