# Efficient MaxCount and threshold operators of moving objects

- 777 Downloads
- 12 Citations

## Abstract

Calculating operators of continuously moving objects presents some unique challenges, especially when the operators involve aggregation or the concept of congestion, which happens when the number of moving objects in a changing or dynamic query space exceeds some threshold value. This paper presents the following six *d*-dimensional moving object operators: (1) MaxCount (or MinCount), which finds the Maximum (or Minimum) number of moving objects simultaneously present in the dynamic query space at any time during the query time interval. (2) CountRange, which finds a count of point objects whose trajectories intersect the dynamic query space during the query time interval. (3) ThresholdRange, which finds the set of time intervals during which the dynamic query space is congested. (4) ThresholdSum, which finds the total length of all the time intervals during which the dynamic query space is congested. (5) ThresholdCount, which finds the number of disjoint time intervals during which the dynamic query space is congested. And (6) ThresholdAverage, which finds the average length of time of all the time intervals when the dynamic query space is congested. For these operators separate algorithms are given to find only estimate or only precise values. Experimental results from more than 7,500 queries indicate that the estimation algorithms produce fast, efficient results with error under 5%.

## Keywords

Moving point objects Spatiotemporal data Aggregation operator MaxCount ThresholdCount CountRange## Introduction

Safety can often be reduced to a problem of congestion. The safety of flight depends on separation of airplanes or more generally the maximum number of airplanes that a particular airspace can safely contain, and the maximum number of airplanes that air traffic controllers responsible for directing airplanes can safely track. When considering epidemics, the presence of a single animal with Bird Flue does not indicate the start of an epidemic. Instead the presence of a certain number of instances of the disease indicates a high risk of starting an epidemic, or actual epidemic conditions. Consequently, we see that congestion often links to safety and can predict high risk or even dangerous conditions.

Congestion is defined differently depending on the application. Hence it is necessary to provide moving object operators that take a threshold value as a parameter to define congestion.

In relational databases, Max, Min, Count, Sum and Average form the set of natural aggregation operators. Spatiotemporal databases containing continuously moving objects can not apply these operators in the same way. However, these operators may still function in interesting ways for moving objects. For example, one can ask how many moving point objects exist within a moving and changing (or *dynamic*) rectangular area *at a certain time*, or what is the maximum distance between two moving points *during a time interval*. Obviously, when we are interested in discrete time instances, then the moving point object database can be reduced to a relational database and the above queries can be expressed as simple Count or Max queries.

Moving object databases naturally suggest new aggregate operators that have no equivalents in relational databases. For example, one may ask what is the maximum number of moving-point objects that exist simultaneously within a dynamic rectangular area at any time during a time interval *T*? We call this the MaxCount operator. One also may ask during what time intervals in *T* does there exist more than *M* moving objects within a rectangular area? We call this the ThresholdRange operator. We show that a strong relationship exists between MaxCount and ThresholdRange, and we show that ThresholdRange forms the bases for a family of threshold operators that include: ThresholdCount, ThresholdSum, and ThresholdAverage. A related, though less complex, operator answers the question: what is the number of moving objects that exist within or intersect a dynamic rectangular area at any time instance during interval *T*. We call this the CountRange operator.

Throughout this paper we use the concept of a dynamic *query space*, which is a 3-dimensional space that may move and change size or shape over a continuous time interval *T*. In most papers, including ours, the query space is considered to be some cube, whose corner vertices are defined by two moving points. (We define query space more precisely later in Definition 26.) The moving object operators described in this paper are defined as follows:

### **Definition 1**

(MaxCount (MinCount)) Let *S* be a set of moving points. Given a dynamic query space *R* defined by two moving points *Q* _{1} and *Q* _{2} as the lower-left and upper-right corners of *R*, and a time interval *T*, the MaxCount (*Min-Count*) operator finds the time *t* _{ max ( min )} and maximum (or minimum) number of points *M* _{ max ( min )} in *S* that *R* can contain at any time instance within *T*.

### **Definition 2**

(CountRange) Let *S* be a set of moving points. Given a dynamic query space *R* defined by two moving points *Q* _{1} and *Q* _{2} as the lower-left and upper-right corners of *R* and a time interval *T*, the CountRange query returns the total number of points that intersect *R* in *T*.

### **Definition 3**

(ThresholdRange) Let *S* be a set of moving points. Given a dynamic query space *R* defined by two moving points *Q* _{1} and *Q* _{2} as the lower-left and upper-right corners of *R*, a time interval *T*, and a threshold value *M*, the ThresholdRange operator finds the set of time intervals *T* _{ M } where the count of objects in *R* is larger than *M*.

### **Definition 4**

(ThresholdCount) Let *M* and *T* _{ M } be as in ThresholdRange. Then ThresholdCount returns the number of time intervals in *T* _{ M }.

### **Definition 5**

*M*and

*T*

_{ M }be as in ThresholdRange. Then ThresholdSum returns the total length of time

*T*

_{ s }during which the count is above

*M*. That is, for each

*T*

_{ i }∈

*T*

_{ M }, ThresholdSum returns:

*T*

_{ i }| means the length of the interval

*T*

_{ i }.

### **Definition 6**

(ThresholdAverage) Let *M* and *T* _{ M } be as in ThresholdRange. Then ThresholdAverage returns the average length of the intervals in *T* _{ M }.

The following examples illustrate the use of the new moving object operators.

### Example 7

Airplanes are commonly modeled as linearly moving objects with preestablished flight plans. Suppose, at any time, at most a constant number *M* of airplanes is allowed to be in the O’Hare airspace to avoid congestion. Suppose also a new airplane requests approval of its flight plan for entering the O’Hare airspace between times *t* _{ a } and *t* _{ b }. The air traffic controllers can avoid congestion as follows. If after adding a new flight plan, the MaxCount between *t* _{ a } and *t* _{ b } is still less than *M*, then they can approve the flight. Otherwise, they need to find some alternative path, and check it again against the database.

Air traffic controllers try to direct airplanes as linearly moving objects for fuel efficiency, among other reasons. If they recognize a developing congestion too late, then they often must direct the airplane to fly in circles until the congestion has cleared. That solution wastes fuel. On the other hand, if they recognize the developing congestion early, then they can often simply tell the airplane to change its speed, which saves fuel. Therefore, it is important to identify congestions as early as possible. We may identify congestions by using a MaxCount query where a moving box around the airplane and a time interval [*t* _{ a },*t* _{ b }] define the query. If the MaxCount predicts congestion, then the airplane’s speed can be adjusted early in the flight.

### Example 8

Suppose we want to alert pilots if their current flight path takes them through at least one congested region.

A *Traffic Alert/Collision Avoidance Systems* is a system that provides similar functionality. TCASs only provide alerts for current congestion, not predictive congestion. Although TCASs were implemented in 1986, we continue to have mid-air collisions and near misses indicating that the system still needs improvement. ThresholdRange is a modification of MaxCount that returns all predicted time intervals on the flight path where the Count exceeds a given threshold. Hence using ThresholdRange we can alert a pilot of predicted congestions where more than *M* other airplanes will be within the space *B* around the airplane. Predicting and avoiding these areas can significantly reduce the chances of mid-air collisions.

### Example 9

Suppose we are especially concerned about a rush-hour period [*t* _{ a },*t* _{ b }] that is particularly stressful to air traffic controllers. Suppose controllers can direct at most *M* airplanes safely. We can determine the number of controllers needed during the rush-hour time by executing the CountRange query over the controlled airspace during the rush-hour and dividing by *M*. By ensuring that a sufficient number of controllers are present, safety is achieved and controllers are not over stressed.

The rest of this paper is organized as follows. Section 2 introduces novel data structures used to build buckets. These buckets can then be used in various indexing algorithms to fit the type of application used. Section 3 develops the MaxCount estimation algorithm using a running example. Section 4 develops algorithms for the CountRange and the threshold operators. Both CountRange and ThresholdRange are based on MaxCount. Based on ThresholdRange, algorithms are given for the other threshold operators ThresholdCount, ThresholdSum, and ThresholdAverage. Section 5 gives the experimental results of the implementation. Section 6 reviews the related work. Finally, Section 7 suggests some future work and conclusion.

## Bucket data structures

This section presents an updatable *skew-aware* bucket for indices that models the skewed point distributions in each bucket. The skew-aware technique allows the index structure to perform inserts, deletes, and updates in *fast constant time* using a HashTable to store the buckets. Many spatiotemporal applications, such as tracking clients on a wireless network, particularly need these fast updates and no other MaxCount presented prior to this can meet that requirement. Because the buckets are spatially defined, the bucketing technique also easily adapts to other spatial and spatiotemporal indices such as the R-tree [13]. Hence the technique performs well for applications where search operations or update operations occur more frequently by using an appropriate index.

Our algorithm uses a sweeping method to evaluate the threshold aggregation operators similar to previous approaches from Chen and Revesz [5], Revesz and Chen [25] and Anderson [1]. The algorithm differs in that the sweeping algorithm integrates a skew-aware density function over the spatial dimensions of the bucket to obtain the time dependent count function. The density function in the bucket increases accuracy over methods given in Anderson [1], Chen and Revesz [5] while maintaining the same number of buckets. This idea is a crucial improvement because we model the point distribution skew in a bucket, whereas previous methods adapted to skew by increasing the number of buckets or changing their shape and contents. We also present a precise algorithm for evaluating the threshold aggregation operators that requires no index and runs in *O*(*N*) + *O*(*n* log*n*) time and *O*(*n*) space where *N* is the number of points in the database and *n* is the value of a CountRange query using the same query space and time. Both the threshold aggregation algorithms and the skew-aware bucket data structure presented are implemented and analyzed in 3-dimensional space. We show that the approximation achieves good results while significantly reducing the running times.

Section 2.1 describes the problems related to creating buckets and a specific solution for creating 6-dimensional buckets for 3-dimensional linearly moving points. In all cases, we can extend our method to *d*-dimensions. Section 2.2 describes the method for inserting and deleting a point from a bucket and shows that updates take constant time. Section 2.3 applies two different data structures to contain the buckets suited for applications where either inserts and deletes or threshold operations dominate.

### Bucket data structure

### **Definition 10**

*hex representation*of

*Q*(

*t*) is the tuple (

*v*

_{ x },

*x*

_{0},

*v*

_{ y },

*y*

_{0},

*v*

_{ z },

*z*

_{0}).

If we divide the 6-dimensional space into axis-aligned hyper-rectangles, then each hyper-rectangle becomes a *bucket* *B* _{ i } that contains the moving points whose hex representations fall inside the hyper-rectangle. For each such bucket *B* _{ i }, we define the following.

### **Definition 11**

*B*

_{ i }, its dimensions can be described by inequalities of the form:

*L*indicate lower and the subscripts with

*U*indicate upper bound constants. Similarly to the hex representation of Definition 10, we denote the bucket lower bound as:

*B*

_{ i }.

### **Definition 12**

(Histogram) Given a bucket *B* _{ i } and a constant *s*, we build *histograms* *h* _{ i,1},...,*h* _{ i,6}. Each histogram *h* _{ i,j } is built using the following three steps: (1) Project the points in bucket *B* _{ i } onto the *j*th dimension. (2) Subdivide the projection into *s* equal-size intervals. (3) Record separately the number of points within each subdivision.

### Example 13

*B*

_{ i }has lower bound (5,5,5,5,5,5) and upper bound (10,10,10,10,10,10) and contains the points shown in Table 1. Also suppose the number of subdivisions is

*s*= 5.

The hex representation of a set of moving points

| | | | | |
---|---|---|---|---|---|

5.345 | 7.543 | 5.345 | 8.158 | 5.345 | 5.488 |

6.354 | 9.023 | 6.354 | 5.488 | 6.354 | 5.159 |

7.159 | 8.885 | 7.159 | 6.685 | 7.159 | 7.346 |

7.645 | 9.117 | 7.645 | 5.159 | 7.645 | 8.885 |

8.153 | 7.346 | 8.153 | 6.335 | 8.153 | 7.543 |

8.156 | 6.335 | 8.156 | 7.346 | 8.156 | 9.023 |

9.125 | 5.159 | 9.125 | 9.117 | 9.125 | 9.117 |

9.118 | 6.685 | 9.118 | 8.885 | 9.118 | 6.335 |

9.688 | 5.488 | 9.688 | 9.023 | 9.688 | 8.158 |

9.874 | 8.158 | 9.874 | 7.543 | 9.874 | 6.685 |

*h*

_{ i,1}. It is built as follows. The projection onto velocity

*v*

_{ x }is the first column of Table 1. Since in the

*v*

_{ x }dimension bucket

*B*

_{ i }ranges from 5 to 10, the five subdivisions are 5 <

*v*

_{ x }≤ 6, 6 <

*v*

_{ x }≤ 7, 7 <

*v*

_{ x }≤ 8, 8 <

*v*

_{ x }≤ 9, and 9 <

*v*

_{ x }≤ 10. The first bar of histogram

*h*

_{ i,1}rises to level 1 because the first subdivision 5 <

*v*

_{ x }≤ 6 contains one point. The second bar of histogram

*h*

_{ i,2}rises to level 1 because the second subdivision 6 <

*v*

_{ x }≤ 7 also contains one point. The other values of

*h*

_{ i,1}can be determined similarly.

Figure 1b shows *h* _{ i,2}. It is built using the projection onto position *x* _{0}, which is the second column of Table 1. Here we use the subdivision 5 < *x* _{0} ≤ 6,...,9 < *x* _{0} ≤ 10. In this case the first bar of histogram *h* _{ i,2} rises to level 2 because the subdivision 5 < *x* _{0} ≤ 6 contains two points. The other values of *h* _{ i,2} can be determined similarly.

Further, in this example *h* _{ i,1} = *h* _{ i,3} = *h* _{ i,5} because all the points listed in Table 1 have the same velocities in the *x*, *y*, and *z* directions. Also, *h* _{ i,2} = *h* _{ i,4} = *h* _{ i,6} because the columns for *x* _{2}, *x* _{4}, and *x* _{6} all have the same values but in different orders.

### **Definition 14**

*B*

_{ i }and axis

*j*, the

*axis trend function*

*f*

_{ i,j }is some polynomial function such that the following hold:

- 1.
*f*_{ i,j }≥ 0 over the*j*-dimension of*B*_{ i }. - 2.
*f*′_{ i,j }, the derivative*f*_{ i,j }, does not change sign over the*j*-dimension of*B*_{ i }.

*bucket trend function*

*f*

_{ i }for bucket

*B*

_{ i }is the product of the axis trend functions, that is:

Condition 1 ensures that the bucket trend function, built from the axis trend functions, does not contain a negative probability region. Condition 2 requires that the bucket density increase, decrease, or remain constant when considering any single axis.

### Example 15

*h*

_{ i,1}and

*h*

_{ i,2}in Example 13 can be represented as the following set of points:

*v*

_{ x }in

*h*

_{ i,1}and

*x*

_{0}in

*h*

_{ i,2}, and the second coordinate is

*y*, the number of points.

*h*

_{ i,1}is:

*h*

_{ i,2}is:

*h*

_{ i,1}the value min (

*y*(5) = 1,

*y*(10) = 4.3) = 4.3, and for

*h*

_{ i,2}the value min (

*y*(5) = 2,

*y*(10) = 2) = 2. Since in this case Condition 1 is satisfied by the least square lines, they do not need to be shifted upward. Condition 2 also is satisfied by line segments. Hence the least square lines can be taken as axis trend functions.

*h*

_{ i,1}=

*h*

_{ i,3}=

*h*

_{ i,5}and

*h*

_{ i,2}=

*h*

_{ i,4}=

*h*

_{ i,6}the other axis trend functions can be found similarly to those of

*h*

_{ i,1}and

*h*

_{ i,2}. Hence the trend function becomes:

**Note:** Normally the least square line does not have a negative value in the range of the histogram. If it is negative, it is an indication that the distribution of points in the bucket is unusual. In that case the bucket may be broken into smaller ones, or we may try to find another polynomial axis trend function.

The next lemma considers some properties of axis trend functions in general.

### **Lemma 16**

*Given a bucket B*

_{ i }

*with axis trend functions f*

_{ i,j }

*, let r*

_{1}

*and r*

_{2}

*be identically sized regions in bucket B*

_{ i }

*. If the density in B*

_{ i }

*along each axis monotonically increases from r*

_{1}

*to r*

_{2}

*, then the following holds:*

### Proof

*r*

_{1}to

*r*

_{2}translates into histograms that also increase from

*r*

_{1}in the direction of

*r*

_{2}along each axis. The translation from histograms to the axis trend functions gives the following conditions:

*x*

_{1,j }and

*x*

_{2,j }are the

*j*

^{ th }coordinates of the points in

*r*

_{1}and

*r*

_{2}respectively, and are located the same distance from the

*j*

^{ th }coordinates of the lower bounds of

*r*

_{1}and

*r*

_{2}respectively. Since this constraint holds for each

*j*and

*f*

_{ i,j }≥ 0 we have:

**Note:** Definition 14 and Lemma 16 allow other polynomial functions beside the least square line. In Example 15, we chose the least square line only as a simple example. As the experimental results in Section 5 show, the least square line is a good practical approximation. The approximation works well because the buckets have small sizes. Adjacent buckets can have very different trend functions. While the entire range of the space in which all the points lie would be poorly approximated by a single least square line, the series of least square lines associated with a sequence of small buckets in each spacial dimension serves as a good piecewise linear approximation.

In the following, we assume that the trend functions are polynomial functions derived from the product of linear functions, which are obtained by using the least squares method for each histogram as in Example 15.

### **Definition 17**

*n*be the number of points in the database,

*b*

_{ i }the number of points in bucket

*B*

_{ i }, and

*f*

_{ i }be given by Eq. 6. The

*normalized trend function*

*F*

_{ i }for bucket

*B*

_{ i }is:

*percentage of points*in bucket

*B*

_{ i }is:

### Example 18

*B*

_{ i }in Example 13 is only a part of an index that contains

*n*= 100,000 points. Then calculating

*F*

_{ i }from Eq. 13 requires integrating

*f*

_{ i }over the bucket dimensions where \(\int_{B_{i}}\equiv\int_{5}^{10}...\int_{5}^{10}\) and where

*dϕ*≡

*dv*

_{ x }

*dx*

_{0}

*dv*

_{ y }

*dy*

_{0}

*dv*

_{ z }

*dz*

_{0}rounded to the nearest integer gives:

*B*

_{ i }is

*b*

_{ i }= 10. Hence we have:

Using Definition 17, we can calculate the number of points in a bucket in *O*(1) time using the following lemma.

### **Lemma 19**

*Let B* _{ i } * be a bucket, n the total number of points in the database, and p be given by Definition *17*. Then np is the number of points in bucket B* _{ i } * and np is calculated in O*(1)* time.*

### Proof

*O*(1) time. □

Using the above definitions we can now define the bucket data structure used throughout the rest of this paper.

### **Definition 20**

(Skew Aware Buckets) A *skew aware bucket* is a hyper-rectangle with dimensions given by Definition 11 and that maintains histograms given by Definition 12, additional data for the least squares method, and the normalized trend function given by Definition 17.

Throughout the rest of this paper we refer to skew aware buckets as simply buckets.

### Inserts and deletes

We can maintain the bucket (and hence the index) while deleting or inserting a point for any bucket *B* _{ i } by recalculating the trend function *F* _{ i } for the bucket.

### **Lemma 21**

* Insertion and deletion of a moving point can be done in O*(1)* time.*

### Proof

*P*

_{ a }represented using the hex representation as (

*a*

_{0},

*a*

_{1},

*a*

_{2},

*a*

_{3},

*a*

_{4},

*a*

_{5}), let

*d*

_{ j }, for 0 ≤

*j*≤ 5 be the bucket width in the

*j*

^{ th }, and let

*s*be the number of subdivisions in each histogram. The concatenation of

*id*

_{0}, ...,

*id*

_{5}gives the

*ID*

_{ i }of bucket

*i*to insert (or delete)

*P*

_{ a }into where each

*id*

_{ l }and 0 ≤

*l*≤ 5 is defined by:

*ID*

_{ i }and retrieving bucket

*B*

_{ i }takes

*O*(1) time using a HashTable.

*hw*

_{ i,j }be the histogram-division width for the

*j*

^{ th }calculated as \(hw_{i,j} = \left\lceil \frac{d_j}{s} \right\rceil\). Then

*p*is projected onto each dimension to determine which division of the histogram to update. For the

*j*

^{ th }dimension the

*k*

^{ th }division of histogram

*h*

_{ i,j }is given as follows:

*h*

_{ i,j,k }be the histogram division to update for each histogram. Update

*h*

_{ i,j,k }and the sums \(\displaystyle{\sum}y_i\), and \(\displaystyle{\sum}x_i y_{i}\) from the normal equations in the least squares method.

*N*, \(\displaystyle{\sum}x_i\) and \(\displaystyle{\sum}x_{i}^{2}\) from the normal equations do not need updating since the number of histogram divisions

*s*is fixed within the database.

We can now recalculate each *f* _{ i,j } in constant time by solving the 2 ×3 matrix corresponding to the normal equations of the least squares method for each histogram. For each *f* _{ i,j } calculate the endpoints to determine the required shift amount (Definition 14, property 1) and calculate *f* _{ i } from Eq. 6. Now we calculate *F* _{ i } using Eq. 11. Each of these steps depends only on the dimension of the database. Hence for any fixed dimension we can rebuild the normalized trend function *F* _{ i } in *O*(1) time. □

### Index data structures

There is no need to create a bucket unless it contains at least one point. We consider two classes of data structures for organizing the buckets: HashTables and Trees.

For databases where inserts and deletes are the most common operation, the HashTable approach allows these operations to run in constant time. However, the MaxCount operation will require an enumeration of all the buckets and thus at least a running time of *O*(*B*). As long as the number of buckets is reasonable, this approach works well.

For databases where MaxCount is the most common operation, we may use an R-tree structure [3], [13] where the elements to be inserted are the buckets. This approach speeds up the MaxCount query to *O*(log|*B*| + *R*) where *R* is the number of buckets needed to calculate the query. The insert and delete costs for these R-trees are *O*(log|*B*|), because buckets do not overlap.

Since buckets do not change shape, the database is decomposable and allows each type of aggregation to be calculated from simultaneous executions on subspaces of the index space. We discuss the method and ramifications of this capability at the end of Section 3.4.

## The MaxCount operator

Section 3.1 reviews point domination in higher dimensions. Section 3.2 examines finding the percentage of points in a bucket that are in the query space as a function of time. Section 3.3 puts the two previous sections together to create an efficient approximate MaxCount algorithm for *d*-dimensions. Section 3.4 describes an inefficient exact MaxCount algorithm. The naive but exact MaxCount algorithm is useful as a comparison tool with the efficient but approximate MaxCount algorithm in Section 3.3. The comparison is presented later in Section 5.

### Review of point domination

### **Definition 22**

*Q*(

*t*) dominates

*P*(

*t*) at time

*t*if and only if the following condition holds:

*dominating lines*of

*Q*(

*t*) at time

*t*are the following:

For convenience in visualizing the hex representations, we introduce the following definition.

### **Definition 23**

(*x*-view, *y*-view and *z*-view) The projection of a 6-dimensional space of hex representations onto the first two dimensions is the *x-view*, onto the second two dimensions the *y-view*, and onto the third two dimensions the *z-view*.

It is now easy to prove the following lemma.

### **Lemma 24**

*Let * *Q*(*t*)* and * *P*(*t*)* be moving points, and let * *l* _{ x } *,l* _{ y } *, and * *l* _{ z } * be the dominance lines of * *Q*(*t*)* at time * *t* * as in Definition 22. Let * *x* *-view, * *y* *-view, and * *z* *-view be the projections of the hex representations of moving points as in Definition 23. Then * *Q*(*t*)* dominates * *P*(*t*)* at time * *t* * if and only if *(*x* _{1},*x* _{2})* lies below * *l* _{ x } * in the * *x* *-view, *(*x* _{3} *,x* _{4})* lies below * *l* _{ y } * in the * *y* *-view, and *(*x* _{5} *,x* _{6})* lies below * *l* _{ z } * in the * *z* *-view.*

### Proof

Clearly, this condition is satisfied if (*x* _{1},*x* _{2}) lies below *l* _{ x } in the *x*-view, (*x* _{3},*x* _{4}) lies below *l* _{ y } in the *y*-view, and (*x* _{5},*x* _{6}) lies below *l* _{ z } in the *z*-view. □

### Example 25

Suppose that in the hex-representation *Q*(*t*) = (4,6,5,6,9,10). Then *Q*(*t*) dominates which points *P*(*t*) = (*x* _{1},*x* _{2},*x* _{3},*x* _{4},*x* _{5},*x* _{6}) at time *t* = 1?

*t*= 1, the point

*Q*(

*t*) dominates those points

*P*(

*t*) which satisfy the above condition.

Unlike in Example 25, in many applications we are interested in two moving points *Q* _{1}(*t*) and *Q* _{2}(*t*), which define the query space as follows.

### **Definition 26**

*Q*

_{1}(

*t*) and

*Q*

_{2}(

*t*) with respecting dominating lines

*l*

_{ x1},

*l*

_{ y1},

*l*

_{ z1}, and

*l*

_{ x2},

*l*

_{ y2},

*l*

_{ z2}at time

*t*, the

*query space*within the 6-dimensional space of hex representations is a hyper-tunnel formed by the cross product of the areas between

*l*

_{ x1}and

*l*

_{ x2}in the

*x*-view, between

*l*

_{ y1}and

*l*

_{ y2}in the

*y*-view, and between

*l*

_{ z1}and

*l*

_{ z2}in the

*z*-view as shown in Fig. 2.

Two moving objects *Q* _{1}(*t*) and *Q* _{2}(*t*) define a query space at any time *t*. The query spaces are similar to each other. At any time *t* the projection of a query space onto each view is an area between two parallel lines with slopes − *t* and passing through points *Q* _{1}(*t*) and *Q* _{2}(*t*). Hence we can visualize the query space as it changes over time as a moving region sweeping through space as the slopes of the lines change with time.

Intuitively, when at time *t* a moving point *P*(*t*) is in the query space of moving points *Q* _{1}(*t*) and *Q* _{2}(*t*) as in Definition 26, then it dominates *Q* _{1}(*t*) and is dominated by *Q* _{2}(*t*) in the hex representation. Simultaneously, in the three dimensional space in which the objects move, *P*(*t*) is in the box whose corner vertices are *Q* _{1}(*t*) and *Q* _{2}(*t*). In the following sections, like many other authors, we use the hex representation as a useful dual or alternative representation that is more amenable to efficient indexing. However, the exact indexing method presented in this paper is novel.

### Approximating the number of points in a bucket

A basic query is to find the number of points that lie in the query space. To answer this basic query we need to look at each of the buckets separately. When the query space includes an entire bucket, then we need to count the every point in the bucket. That can be done using the normalized trend function of the bucket as defined in Definition 17. However, in many cases the query space includes only a part of a bucket. In that case we need to count only the points that are in the intersection of the bucket and the query space. In this section, we show how to do that using a *percentage function* that estimates the percentage of total points in the query space to be within the intersection of the bucket and the query space. This requires to extend Definition 17 as follows.

### **Definition 27**

*r*

_{1}is the region of the bucket in the query space and is above the boundary line, then the

*percentage function*is:

*r*

_{1}and

*r*

_{2}correspond to regions above

*Q*

_{1}and

*Q*

_{2}, respectively, then the

*percentage function*is:

The percentage function can be computed for each bucket as shown in the next lemma.

### **Lemma 28**

* For any time t and query space, the percentage function can be computed for each bucket that intersects the query space at time t.*

### Proof

For each case shown in Fig. 3, we describe the function that results from integration in one view. To extend the result to any number of views, we take the result from the last view and integrate it in the next view. If the region below the line were desired, \(p_{lower}=\frac{b_i}{n}-p\) gives the percentage of points below the line.

For cases (a)–(h) below, let *Q* = (*x* _{1,q },*x* _{2,q },...,*x* _{6,q }). For the *x*-view, let the lower left corner vertex be (*x* _{1,l },*x* _{2,l }) and the upper right corner vertex be (*x* _{1,u },*x* _{2,u }). In addition each line denoted *l* is given by *x* _{2} = − *t* (*x* _{1} − *x* _{ i,q }) + *x* _{ i + 1,q } and corresponds to a line shown in the corresponding case in Fig. 3.

### **Case (a)**

*l*crosses the bucket at

*x*

_{1,l }and

*x*

_{2,u }. The integral over the shaded region is:

*dx*

_{2}contains

*x*

_{1}. This dependence within each view does not affect the integration in the remaining four dimensions. The solution to Eq. 25 has the form:

### **Case (b)**

*l*crosses the bucket at

*x*

_{1,u }and

*x*

_{2,u }. The integral over the shaded region is:

### **Case (c)**

*l*crosses the bucket at

*x*

_{1,l }and

*x*

_{2,l }. The integral over the shaded region is:

### **Case (d)**

*l*crosses the bucket at

*x*

_{1,u }and

*x*

_{2,l }. The integral over the shaded region is:

### **Case (e)**

*l*crosses the bucket at

*x*

_{1,l }and

*x*

_{1,u }. The integral over the shaded region is:

*a*=

*b*= 0.

### **Case (f)**

*l*crosses the bucket at

*x*

_{1,l }and

*x*

_{1,u }. The integral over the shaded region is:

### **Case (g)**

*l*crosses the bucket at

*x*

_{1,l }and

*x*

_{1,u }. The integral over the shaded region is:

*d*=

*e*= 0.

### **Case (h)**

*l*crosses below all the corner vertices hence the integral of the function is:

*t*is of the form:

*t*≠ 0 when

*d*

_{ x },

*d*

_{ y },

*d*

_{ z },

*e*

_{ x },

*e*

_{ y },

*e*

_{ z }≠ 0. Finally, renaming variables gives the general form:

*t*≠ 0 when

*d*

_{ i }≠ 0 for 1 ≤

*i*≤ 6. Since Eq. 37 is closed under subtraction,

*p*from Eq. 24 will also have the same form. □

As in Lemma 19, we can still find the number of points in the bucket by multiplying the percentage function by *n*, the total number of points in the database.

The MaxCount operator is more complex than just the basic operator that finds the number of points in the query space at a specific time. By Definition 1, the MaxCount operator needs to return the maximum number of moving points that are in a query space *at any time instance t* _{ max} *during a time interval*, and it also needs to return the time *t* _{ max} when the maximum occurs.

Hence for a time interval, we need to consider what happens to the query space as time changes. The query space changes as the dominance lines through the query points change, keeping always a slope of − *t*. As the dominance lines change their clopes, the query space sweeps across space and through the buckets that partition the space. As the query space sweeps through a bucket *B* _{ i }, it may cross the bucket corner vertices. Each time such a cross happens, the case within the proof of Lemma 28 that applies to *B* _{ i } may change in one or more of the views. Hence handling large time intervals requires further case analysis. This motivates the following definitions.

### **Definition 29**

(Bucket Time-Interval) A *bucket time-interval* for bucket *B* _{ i } is a maximal time interval during which no vertex of bucket *B* _{ i } enters or leaves the query space.

### **Definition 30**

(Index Time-Interval) An *index time-interval* for a set of buckets \(\cal B\) of an index structure is a maximal time interval during which no vertex of *any* bucket in \(\cal B\) enters or leaves the query space.

We follow the convention of denoting bucket and index time-intervals as half-open intervals [*l*,*u*) where *l* is the lower bound and *u* is the upper bound time instance. This allows a partition of large time intervals.

### **Definition 31**

(Time-Partition Order) Let \(\cal B\) be the set of buckets in an index. Let (*t* ^{[},*t* ^{]}) be a time interval. The *Time-Partition Order* is the set of ordered time instances *t* _{1},*t* _{2},...,*t* _{ i },...,*t* _{ k } such that \(t_1=t^[\) and \(t_k=t^]\), and each [*t* _{ i },*t* _{ i + 1}) is an *index time-interval*.

Now we are ready to prove the following.

### **Lemma 32**

* For any time interval element of the Time-Partition Order, the * MaxCount * operator can be approximately computed in O*(1)* time.*

### Proof

*temporal dimension*by first taking the derivative:

*t*≠ 0.

Solving *p*′ = 0 requires finding the roots of this 12-degree polynomial, which is not possible using an exact method. Hence we use a common numerical method for an approximate solution. We look at the graph of *p*′. We check some constant *c* intervals of Eq. 38 for a change in sign. If there exists a sign change, use the bisection method to find the root. If two points lie within a small *ε* of 0, we perform a check for each of these intervals when no change of sign is found. If some roots exist, then we check them for maximal values along with the end points.

The percentage function *p* is calculated in *O*(1) time. Finding *p*′ = 0 also takes *O*(1) time. By placing a constant bound on the number of iterations in the bisection method, we bound by a constant the time required in the numerical section of the algorithm. Plugging in the solution found by the bisection method along with the endpoints also takes *O*(1) time. Hence, the running time to find the maximum within any time interval element of the time-partition order takes *O*(1) time. □

Lemma 32 shows that a constant time is enough to find an estimate for MaxCount. However, it is not clear from the lemma how close the estimate is to the actual number of points. It is an interesting open problem to prove a constraint on the approximation ratio (the estimated number divided by the actual number) or some statistical confidence value for the estimation. Instead of a mathematical proof, this paper investigates experimentally the accuracy of the estimation in Section 5.

Although Lemma 32 considers only one time interval element of the Time-Partition Order, it is easy to see that all the other elements can be found easily and handled similarly. At first we find the bucket time-intervals using the following lemma.

### **Lemma 33**

* Given B buckets, all the bucket time-intervals can be found in O*(*B*)* time.*

### Proof

The bucket time-intervals can be found by finding for each bucket the slopes of the lines through its corner vertices and the query points. Any slope − *t* means a crossing at time *t*. Since each bucket has only a constant number of corner vertices, and there are only two query points, finding the crossing times and bucket time-intervals requires only a constant number of calculations for each bucket. Since there are *B* buckets, this requires only *O*(*B*) total time. □

We present the rest of the MaxCount algorithm in Section 3.3.

### An efficient approximate MaxCount algorithm

The algorithm to compute MaxCount with each line labeled with its running time is given above. Line 01 initiates a set of bucket time-interval objects to be empty. Line 03 returns a list of ordered times when a line through *Q* _{1} or *Q* _{2} crosses a bucket corner vertex. As in the proof of Lemma 33 this can be found in *O*(1) time. Line 05 turns this list into a set of *TimeInterval* objects and adds them to the set of *TimeIntervals*. We list this “for each” loop as *O*(1) because it consists of a constant number of calculations bounded by the number of vertices in the bucket. Line 08 uses the linear time sorting algorithm BucketSort to sort the bucket time intervals. Line 09 creates the time-partition order and index bucket time intervals from the bucket time intervals in *O*(*B*). An additional pass adds the bucket time intervals to the appropriate index time-intervals in *O*(*B*). Lines 10-12 perform the MaxCount calculation discussed above.

In order to use the linear time BucketSort algorithm, we need the following definition and lemmas.

### **Definition 34**

*time intervals*

*A*and

*B*is:

The distribution of time interval objects created in Line 08 of the MaxCount algorithm may not be uniform across the query time interval *T* = [*t* ^{[},*t* ^{]}]. However, we can still prove the following.

### **Lemma 35**

*If the distribution of buckets is uniform, then the distribution of bucket time-interval objects can be uniformly distributed within the sorting buckets of the bucket sort.*

### Proof

*Q*with slopes

*s*

_{ i }= −

*t*

_{ i }and

*s*

_{ i + 1}= −

*t*

_{ i + 1}. Suppose \(\triangle t=1\) with

*t*

_{0}= 0 and

*t*

_{1}= 1, then the angle between the two lines is \(\triangle s=\frac{\pi}{4}\). The solid lines in Fig. 4 show that half of the bucket corner vertices are swept by the line sweeping through

*Q*between

*s*

_{0}= 0 and

*s*

_{1}= − 1. Consider a query time interval [0,10]. Half of the corner vertices, and thus half of the time intervals, are between time

*t*= 0 and

*t*= 1. Thus, we conclude that the time interval objects created by sweeping will not be uniformly distributed throughout the query time interval.

*Q*′ be the midpoint between

*Q*

_{1}and

*Q*

_{2}. Let

*S*= {

*t*

_{1},...

*t*

_{ k }} where \(t_1 = t^[\), \(t_k=t^]\) and

*t*

_{ i + 1}−

*t*

_{ i }=

*L*for some positive constant

*L*and 1 ≤

*i*≤

*k*− 1. Let

*D*

_{ B }be a bucket that contains the space in the 6-dimensional index. Model the normalized bucket function for

*D*

_{ B }as a constant

*F*= 1. Thus

*p*, the bucket probability, from Eq. 36 becomes the hyper-volume of the space swept by the line through

*Q*′. By Lemma 32, we can find the area for a specific time interval in

*S*in constant time. The percentage of sorting buckets,

*posb*

_{ i }, needed in any time interval

*T*

_{ i }= [

*t*

_{ i },

*t*

_{ i + 1}] ∈

*S*within the query time interval is given by:

*N*be the number of sorting buckets. The number of sorting buckets,

*nosb*

_{ i }, assigned to interval

*i*is:

*nosb*

_{ i }< 1 we can combine it with

*nosb*

_{ i + 1}. If the query time interval is very large, then we may need to include multiple time intervals from

*S*to get one sorting bucket. Thus, we create more sorting buckets (with smaller time intervals) in areas where the expected number of bucket time intervals is large. Conversely, we create fewer sorting buckets (with larger time intervals) in areas where the expected number of bucket time intervals is small. Hence we model each sorting bucket so that its time interval length directly relates to the percentage of bucket time intervals that are assigned to it. Thus, we conclude that we will uniformly distribute the time interval objects across all sorting buckets. □

### **Lemma 36**

* Insertion of any bucket time-interval object T* _{ O } * into the proper sorting bucket can be done in O*(1)* time.*

### Proof

*k*time intervals in Lemma 35. Call these

*sorting time interval objects*where each object contains: the lower bound

*l*, the upper bound

*u*, the number of sorting buckets assigned to this interval

*b*

_{ s }, the length of the time interval for the sorting bucket

*w*and an array

*B*

_{ p }containing pointers to these sorting buckets. Let

*A*be the array of sorting time interval objects, and

*L*be the length of each time interval where the time intervals are as in Lemma 35. Then, finding the correct sorting bucket for

*T*

_{ O }requires two calculations:

*T*

_{ O }can be inserted into the proper sorting bucket in

*O*(1) time. □

Now we can prove the following.

### **Theorem 37**

*The running time of the * *MaxCount* * algorithm is O*(*B*)* where B* * is the number of buckets.*

### Proof

Let *H* be the set of buckets where each bucket *B* _{ i } contains the normalized trend function *F* _{ i }. Let *Q* _{1} and *Q* _{2} be the query points and [*t* ^{[},*t* ^{]}] be the query time interval. (Lines 01-07): Calculating the time intervals takes *O*(*B*) time because the cross times for each bucket can be calculated in constant time. (Line 08): By Lemmas 35 and 36, we have an approximately even distribution of time interval objects within the sorting buckets where we can insert an object in constant time. This result fulfills the requirements of the BucketSort, [9], which allows the intervals to be sorted in *O*(*B*) time. (Lines 09-12): Calculate the MaxCount and time for each time interval in constant time using Lemma 32. These lines take *O*(*B*) time because there are *O*(*B*) time intervals. Finding the global MaxCount and time requires retaining the maximum time and count at line 11. Returning the MaxCount and time takes *O*(1) time. Thus, the running time is given by *O*(*B*) + *O*(*B*) + *O*(*B*) + *O*(1) = *O*(*B*). □

### An inefficient exact MaxCount algorithm

*N*is the number of moving points and

*n*is the number of moving points in the query space.

It is possible to slightly improve the algorithm below. First, divide the index space into *k* subspaces and maintain separate partial databases for each. Assign processes on individual systems to each database to calculate the MaxCount query and return the time intervals to a central process. Merging the time interval lists into a global time interval list saves time on the sorting part of the algorithm. The running time for each of *k* partial databases would be close to \(O(\frac{n}{k} \log \frac{n}{k})\). This result is an approximate value because we do not guarantee an even split between partial databases. Placing buckets for each partial database in a Tree structure may be reasonable and could cut down the average running time to *O*(log*N* + *n* log*n*/*k*).

## The CountRange and threshold operators

Next we describe the CountRange operator in Section 4.1 and the threshold operators ThresholdRange, ThresholdCount, ThresholdSum, and ThresholdAvg in Section 4.2.

### The CountRange operator

The CountRange algorithm is an adaptation of MaxCount in that it is the Count portion of the MaxCount query. Using the equations for the cases described in Fig. 3, we calculate the CountRange as follows.

*Q*

_{1}and

*Q*

_{2}at times

*t*

^{[}and

*t*

^{]}. Figures 5 and 6 show these four lines. Figure 5 shows the shaded area defined by:

*i*is equal to the count of the bucket, then bucket

*i*is completely contained in the query. If \(\triangle \overleftarrow{p}\) and \(\triangle \overrightarrow{p}\) for bucket

*i*are equal to 0, then bucket

*i*is not contained in the query. If neither of these is true, we approximate the count for bucket

*i*as the \(\max (\triangle \overleftarrow{p}, \triangle \overrightarrow{p})\). That is, we calculate the number of points in bucket

*i*that contribute to the CountRange as:

*Q*

_{1}and

*Q*

_{2}available and not discard them after finding \(\triangle p\) (see Eq. 24).

Hence, we have the following algorithm for CountRange:

### **Theorem 38**

* The * CountRange * query runs in O*(*B*)* time.*

### Proof

Consider two different data structures for our buckets: HashTables and R-trees. In the case of indexing using an R-tree, the worst case requires that we examine all buckets used in generating CountRange. It is possible that this list could include all *B* buckets giving a worst case of *O*(*B*). In the case of using a HashTable, we must examine all *B* buckets. By Lemma 19, and because Eqs. 37 and 50 are calculated in constant time, each bucket can be examined to determine the count that contributes to the CountRange query in constant time. Therefore, the algorithm runs in *O*(*B*) time. □

We note that CountRange is a simplification of the MaxCount operator in that we do not examine every time interval. Further we have a slightly different form of \(\triangle p\) from Eq. 24 to find the count.

### The threshold operators: range, count, sum and average

We implemented the ThresholdRange algorithm as shown below. The algorithm relates to MaxCount in the way we calculate the aggregation. We maintain a running count to find time intervals that exceed the threshold value *M*. If we set the threshold value near the MaxCount value (*M* → MaxCount), ThresholdRange finds a small interval containing the MaxCount. We demonstrate this in the experimental results, Section 5.

We can analyze the running time of ThresholdRange similarly to the running time of MaxCount and prove the following theorem.

### **Theorem 39**

* The estimated * *ThresholdRange* * query runs in O*(*B*)* time.*

### Proof

The ThresholdRange algorithm differs from the MaxCount algorithm only in lines 09-17. Lines 11-14 run in *O*(1) time. Line 10 executes lines 11-13 *O*(*B*) times. In line 16, \(\textsc{Merge}(TimeIntervals)\) is a linear walk of the time intervals that joins adjacent time intervals *T* _{ a } and *T* _{ b } when *T* _{ a } ∪ *T* _{ b } would form a continuous time interval. The calculation is trivially *O*(1) time for joining the adjacent intervals. Hence, we conclude by Theorem 37 that the ThresholdRange runs in *O*(*B*) time. □

Based on ThresholdRange, we give the following three operators:

ThresholdCount:

By adding a line between 14 and 15 in the ThresholdRange algorithm that counts the merged time intervals, we can return the count of time intervals during the query time interval where congestion occurs. This count of time intervals gives a measure of variation in congestion. That is, if we have lots of time intervals, we expect that we have a large number of pockets of congestion. Since ThresholdCount does not give information relative to the entire time interval, it may need to be examined in light of the total time above the threshold.

ThresholdSum:

By summing the times instead of using the ∪ operator in line 13 of the ThresholdRange algorithm, we can return the total congestion time during the query time interval. This total gives a measure of the severity of congestion that may be compared to the length of query time.

ThresholdAverage:

By adding a line between lines 14 and 15 in the ThresholdRange algorithm that finds average length of the merged time intervals, we can return the average length of time each congestion will last. This average gives a different measure of the severity of each congestion.

It is easy to see that and none of the changes to the ThresholdRange algorithm affects the running time. Hence the ThresholdCount, ThresholdSum, and ThresholdAverage algorithms run also in *O*(*B*) time.

## Experimental results

We collected data from over 7,500 queries that were selected from a set of randomly generated queries. The selection process weeded out most similar queries and kept a set that represents narrow queries, wide queries, near corner or edge queries, and queries outside the space contained in the database. Throughout our experiments, we did not see significant accuracy fluctuation due to any of these types of queries.

Each experimental run consists of running all of the queries at several different decreasing bucket sizes on a single data set. We made experimental runs against data sets ranging from 10,000 points to 1,500,000 points.^{1}

*a*,

*b*) returns the sum of the lengths of intervals in

*a*not covered by intervals in

*b*. We divide the result by the accurate ThresholdSum to determine the ThresholdRange error:

We performed all the data runs on a Athlon 2000 with 1 GB of RAM. During each of the queries the program does not contact the server tier and, thus, minimizes the impact of running a server on the same computer. The program pre-loads all data into data structures so that even the exact algorithms do not contact the server tier.

### Data generation

*i*

^{ th }point generated for the database is located near a randomly selected cluster at a distance between 0 and

*d*, where

*d*is proportional to

*i*. This method is similar to the Ziggurat [16] method of generating Gaussian (or normal) distributions used in the GSTD [33] and G-TERD [35] spatiotemporal data generators [18]. However, our method does not generate strictly Gaussian distributions since the distributions may stretch and compress along an axis. Our goal was to generate a cluster that represents a source location and velocity that has most elements starting near a center point and decreasing as one moves to a boundary for the cluster. This method models source regions where the objects all head about the same direction. A secondary goal was to make certain that clusters were random in size and shape. The program is also capable of approximating a Zipf distribution used in [6], [25], [31]. However, a single Zipf distribution does not test the adaptability of our algorithm well. I.e. our algorithm is capable of modeling a Zipf distribution and as such we could use a single bucket. Figure 7 shows a sample of a data set with points projected onto the three views. The clusters look even more random, because they can overlay one another. When one looks at these, they nearly resemble the lights of a city from the air.

Along with a single Zipf distribution, we also note that a randomly generated uniform-distribution is not a good distribution to use for these types of experiments. Uniform distributions do not test the ability of the algorithm to adapt. In fact from earlier experiments in [1] we have found that using such a distribution gives great (though meaningless) results. The problem resolves to a system capable (and willing to) model a uniform distribution finding a nearly perfect uniform distribution to model. Hence these results are neither realistic, nor meaningful.

### Parameter effects

The index space ranges from 0 to 100 in each dimension. The *number of points* in the different data sets ranges from 10,000 to 1,500,000. The following parameters were used in creating the index and finding the MaxCount.

### **Size of Buckets:**

The size of the buckets determines the number of possible buckets in the index. In the experiments, buckets divide the space up such that there are 5 to 20 divisions in each dimension.^{2} These divisions equate to bucket sizes ranging from 5 to 20 units wide in each dimension. Relative to our previous work [1], this algorithm puts much more space into each bucket creating bigger buckets.

### **Query Location:**

Locating the query near the lower or upper corners affects relative accuracy because the query returns very few points. Queries in this region are not interesting because they rarely involve many points and represent a query region that moves away from points in the database or barely moves at all. The small number of points returned indicates use of the exact algorithms.

### **Query Types:**

In [1], we considered queries with several different characteristics: dense, sparse, and Euclidean distance as it related to bucket size. By modeling the skew in buckets, we minimize the effect of these characteristics to the point that they did not impact the query error. Queries where the distance between the query points was small appeared to do as well as wider queries *providing they returned a reasonable number of points*. This result is a clear improvement over previous work that assumed uniform density within a bucket.

### **Cluster Points:**

Index space saturation determines the number of buckets necessary for the index. The number of cluster points does not appear to affect error as much as the space saturation. Further, we do not consider a larger number of cluster points reasonable since the index space approaches a uniform distribution as the number of cluster points increases. Gaps introduce difficult areas to model when they are not uniform. And once again we reiterate, uniform distributions are not useful. In our experiments cluster points number between 10 and 50.

### **Histogram Divisions:**

Increasing histogram divisions to *s* > 5 had no affect on the accuracy. This result is not unexpected because histograms are used to define a trend function relative to trend functions on other axes. Increasing the histogram divisions has a tendency to flatten the lines. However, normalization flattens the trend function while maintaining the relationships between trends and hence this behavior is easily explained. Thus, increasing histogram divisions only increases the running time without increasing accuracy.

### **Threshold Value:**

The threshold value determines the accuracy when set to low values compared to the number of points in the database. As expected, these extreme point values produce accurate estimations. High values also follow this trend.

### **Time Endpoints:**

When dealing with either small time end points or small buckets, the method is susceptible to rounding error. In particular, Eq. 37 contains both *t* ^{6} and \(\frac{1}{t^6}\) terms. For very small values, on the order of 1 ×10^{ − 54} for 64-bit doubles, these calculations are extremely sensitive and care must be given to guard against rounding error. Those errors showed in two ways. First, by a direct warning programmed into the solution, and second, by a series of fairly stable time values for the MaxCount followed by unstable variations when increasing the number of buckets. At some point, smaller bucket sizes increase the likelihood of errors in both time and count values. Also smaller buckets contain fewer points, which impacts the size of the constants in Eq. 37. Hence, as the bucket size becomes smaller in successive runs, the existence of instability in the time values after a series of stable values predicts that an accurate MaxCount may be found in the previous larger bucket size. *Throughout our experiments, this condition was an excellent predictor of an accurate * *MaxCount*.

^{12}buckets. Clearly this number is unrealistic for common moving object applications where we may be dealing with million(s) of objects. In practice the number of buckets needed to reach acceptable error levels was between 78,000 and 227,000 buckets. These numbers reflect the ability to reach error levels under 5% and were roughly related to the saturation of the space by the points. It should be clear that a higher saturation of the space by points would require a larger number of buckets. Figure 8 shows that we had a roughly linear increase in the number of buckets for an exponential increase in the space. This pleasant surprise indicates that for unsaturated data sets, the exponential explosion of space is manageable.

### Running time observations

A natural question is when to use the exact versus the estimated methods. In runs with a small number of points that need to be processed, the exact and estimation methods run about equally fast. However, when the result size reaches values greater than 40,000 (our experiments returned sets as large as 331,491), the estimation algorithms run up to 35 times faster than the exact algorithms. Further, we note that the error is less predictable at smaller results sizes. Hence for small databases or in queries that return small result sets, efficiency and accuracy both indicate using the exact method. However, for large data sets greater than or equal to 1 million points, the estimation method greatly out-performs the exact method.

### Operator observations

As expected, we noticed that each operator runs in about the same time as MaxCount. Only error values seemed to be different when studying different types of aggregation (e.g., when studying overlap error in ThresholdRange versus count error in MaxCount). Never-the-less, we have similarities between the results. Almost all the figures in this section look like a view of mountains from a valley. That is what we expected to see and the lower and flatter the terrain the better. Buckets increase from back to front and point set sizes increase from left to right.

### MaxCount

The exact MaxCount provided the values against which our estimation algorithm was tested for accuracy. Since the method does not rely on buckets, and has zero error, we note only that on queries with small result sizes, this method performs as well, or better than the estimation algorithm.

### CountRange

Count range also performs about the same speed as the threshold operators due to its similar implementation.

### ThresholdRange

*T*= 10. ThresholdRange error gives the percentage of the exact intervals not covered by the estimation value, and ThresholdRange excess error gives the percentage of the estimation not covering the exact. These figures show that our method acts conservatively in covering more than is needed. However, at larger point-set sizes, we still achieve under 5% error. Figure 12 shows 0% error caused by the point count staying above 10% in data sets containing more than 30,000 points. Figure 13 shows that we covered at least 10% more time in the query time interval than needed until we reach larger point sets. Still, we showed improvement with more buckets.

### ThresholdCount

This operator is the only operator that does not have relative error measurements. Instead we report the average number of intervals the estimation method differs from the exact method. As you can see, we differ by two from the correct number.

*T*= 10 where the errors are small. Figure 19 (

*T*= 1,000) looks much worse, but in reality we are still below two intervals off. We also note that the estimation may split or combine an interval incorrectly when the intervals are very close together without greatly affecting the error of other operators. Given this possibility, the results are excellent.

### ThresholdSum

*T*. As one can see in Fig. 20, at higher bucket counts we have excellent error rates at

*T*= 10. We didn’t always expect great results at this threshold level across all data sets, but ThresholdSum gives this result consistently all the way across.

### ThresholdAverage

*T*= 10. The Figure also shows that even though a few of the data sets tended to have good results at five and ten buckets, these results are not guaranteed in general. In Fig. 23, the error reaches a plateau below 5% with only small bumps in the data.

## Related work

Our research fits into the general area of moving object and spatio-temporal operators, which attracted a huge interest in recent years, shown by the recent textbooks [12], [23], [26], and [29]. However, our research is still unique in several respects. Section 6.1 discusses the background of the MaxCount operator, which was developed by our research group. Section 6.2 discusses previous work that is related to CountRange and the threshold operators.

### The uniqueness of the MaxCount operator

To our knowledge, the MaxCount operator is a unique idea of the second author of the present paper. The MaxCount operator was first described in [25] by the second author and Yi Chen, a graduate student at the University of Nebraska-Lincoln. Revesz and Chen [25] considered only 1-dimensional moving point objects, which were represented in a 2-dimensional static point in a dual space. Their algorithm generated for any *N* points in this dual space, an *O*(*N* ^{2}) size data structure that divided the dual space into *N* regions, such that in each region all the possible query points had the same MaxCount. Therefore the problem of MaxCount was reduced to the problem of point location in a subdivision of the plane. It was not a practical solution because of the quadratic size of the data structure. Moreover, the data structure was static, that is, it had to be regenerated every time a point was inserted into the database. Chen and Revesz [5] presented a second approach to the MaxCount problem. This approach also was limited to 1-dimensional moving objects, but it made a brake from an exact solution and was the first MaxCount algorithm to consider estimation. Chen and Revesz [5] assumed that the points within each bucket had a uniform distribution.

Later the first author, while a graduate student at the University of Nebraska—Lincoln, and the second author developed a new approach to the MaxCount operator. The distribution of the buckets was extended from uniform to the one defined by trend functions as in Definition 14. A preliminary result for 2-dimensional moving objects was presented in [1, 2]. That preliminary result did not consider higher dimensions, and it also had only a static index structure.

*d*-dimensional moving points, is the final and only journal article on MaxCount. The prior works on MaxCount were presented only in conferences. Table 2 summarizes the history of the MaxCount operator.

The MaxCount operator is a new moving object operator that cannot be reduced to other known operators. In particular, the Count and the Max operators have only a titular relationship to the MaxCount operator because one cannot use the Count and Max operators to implement the MaxCount operator with a continuously moving query space. Hence Table 2 does not discuss Count and Max as related work.

### The CountRange and the threshold operators

CountRange and Range operators summary

Max. dim. | Worst case time | Worst case space | Exact or est. | Reference |
---|---|---|---|---|

2 | \(O(N^{\frac{3}{4}+\epsilon}+k)\) | | Exact | Kollios et al. [15] |

2 | | | Exact | |

2 | | | Exact | Papadopoulos et al. [19] |

3 | | | Exact | Saltenis et al. [27] |

3 | | | Exact | Cai and Revesz [4] |

d | | | Exact | Porkaew et al. [22] |

d | \(O(B^{d-1} \log_{B}^{d} N)\) | \(O(\frac{N}{B}\log_{B}^{d-1} N)\) | Exact | Zhang et al. [36] |

1 | | | Est. | Revesz [24] |

2 | \(O(\log_B N + \frac{C}{B})\) | | Est. | Kollios et al. [15] |

2 | | | Est. | Choi and Chung [6] |

d | | | Est. | Tao, Sun and Papadias [31] |

d | \(O(\sqrt{N})\) | | Est. | Tao and Papadias [30] |

d | | | Est. | present paper |

The related works summarized in Table 3 use a variety of different techniques. Kollios et al. [15] uses a simplex range method. Papadopoulos et al. [19] uses an *R* ^{*}-tree model. Saltenis et al. [27] and [4] define time-parametric *R*-trees. Cai and Revesz [4] usually gives a tighter parametric bounding box than [27] gives for the same set of moving objects in each node of the time-parametric *R*-tree. Porkaew et al. [22] is another modification of the *R*-tree model, while Zhang et al. [36] uses an ECDF-B-tree. Revesz [24] also uses a modified ECDF-B-tree and estimates the points below a line in the dual space by a formula that reduces to a set of ECDF-B-tree queries. Choi and Chung [6] seems to be the first spatio-temporal selectivity estimation method with buckets. Tao and Papadias [30] give an MVRB-tree-based estimation algorithm.

As can be seen, all the above methods differ in technique from our particular approach to the CountRange and various threshold operators. Our approach is related to our MaxCount algorithm, which we defined and solved first, only later exploring the relationship between MaxCount and the other operators. Hence our CountRange and threshold algorithms are novel approaches to known problems. The new approaches yield algorithms that compare well to previous algorithms in the worst case running time and space requirements, as shown in Table 3.

There are some additional related works which are not shown in Table 3 because they use different assumptions than the above related works. These can be grouped into three groups are follows.

The first group of related works restrict the movement of moving objects in some way. Civilis et al. [7], [8] gave indexing methods that use networks, such as roads. Pfoser and Jensen [21] also considered spatiotemporal range queries in networks. Gupta et al. [11] gave a technique for answering spatiotemporal range queries on objects that move along curves in a planar graph. de Almeida and Güting [10] proposed the MON-tree to index moving objects in networks to find the spatiotemporal range and windows queries.

The second group of related works consider only predicting the positions of particular (non-linearly) moving objects at some future time. Mokhtar et al. [17] queries the past, present, and future positions of moving objects in constraint databases. Tayeb et al. [32] adapted the PMR-quadtree [28] for indexing moving objects to answer time-slice queries, which they called instantaneous queries, and infinitely repeated time-slice queries, called continuous queries. Search performance is similar to quadtrees and allows searches in *O*(log*N*) time. Recently, Pelanis et al. [20] proposed the *R* ^{ PPF }-tree that indexes past, present and predictive positions of moving points, and extends the previous work on TPR-trees [27] with a partial persistence framework. Related to these papers, Trajcevski et al. [34] present a method for generating pseudo-trajectories of moving objects about which only limited information is known.

Finally, the third group of related works do not allow the query to move or change shape over time. For example, Hadjieleftheriou et al. [14] works with this assumption to find an efficient estimation method for the areas where the density of objects is above a specific threshold during a specific time interval.

## Future work and conclusion

A practical future work may include implementing our algorithms in a grid computing environment to further decrease the running time. An interesting and deep theoretical open problem is to improve Lemma 32 by proving a constraint on the approximation ratio or some confidence value for the estimation.

We started this research with the MaxCount problem naturally surfacing while we looked at GIS applications regarding congestion, like the example problems of the introduction. Those and related problems will surely play an in increasingly greater role in GIS as mobile phones and wireless computers become even more ubiquous, and there will be a major need to monitor and service them efficiently.

We hope our work inspires further interdisciplinary research and an appreciation of various disciplines, new or old, in fashion or out of fashion, and with large government funding or without. In particular, our work relies more than usual in the indexing area on differential calculus. That is a cautionary tale in an era when computer science departments are dropping introductory calculus courses. Successful interdisciplinary research may involve more than expected.

## Footnotes

## Notes

## References

- 1.Anderson S (2006) Aggregation estimation for 2D moving points. In: Proceedings of the 13th international symposium on temporal representation and reasoning. IEEE Press, pp 137–144Google Scholar
- 2.Anderson S (2007) Software verification and spatiotemporal aggregation in constraint databases. Ph.D. thesis, University of Nebraska-Lincoln, USAGoogle Scholar
- 3.Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The R
^{*}-tree: an efficient and robust access method for points and rectangles. In: Proceedings of ACM SIGMOD international conference on management of data. ACM Press, pp 322–331Google Scholar - 4.Cai M, Revesz P (2000) Parametric R-tree: an index structure for moving objects. In: Proceedings of the 10th COMAD international conference on management of data. McGraw-Hill, pp 57–64Google Scholar
- 5.Chen Y, Revesz P (2004) Max-Count aggregation estimation for moving points. In: Proceedings of the 11th international symposium on temporal representation and reasoning. IEEE Press, pp 103–108Google Scholar
- 6.Choi Y-J, Chung C-W (2002) Selectivity estimation for spatio-temporal queries to moving objects. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, pp 440–451Google Scholar
- 7.Civilis A, Jensen CS, Nenortaite J, Pakalnis S (2004) Efficient tracking of moving objects with precision guarantees. First annual international conference on mobile and ubiquitous systems: networking and services. pp 164–173Google Scholar
- 8.Civilis A, Jensen CS, Pakalnis S (2005) Techniques for efficient road-network-based tracking of moving objects. IEEE Trans Knowl Data Eng 17(5):698–712CrossRefGoogle Scholar
- 9.Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. MIT Press, Massachusetts, USAGoogle Scholar
- 10.de Almeida VT, Güting RH (2005) Indexing the trajectories of moving objects in networks*. Geoinformatica 9:33–60, MON-treeCrossRefGoogle Scholar
- 11.Gupta S, Kopparty S, Ravishankar CV (2004) Roads, codes and spatiotemporal queries. In: Proceedings of the ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. pp 115–124Google Scholar
- 12.Guting RH, Schneider M (2005) Moving objects databases. Morgan Kaufmann, San Francisco, CA, USAGoogle Scholar
- 13.Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Yormark B (ed) Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, pp 47–57Google Scholar
- 14.Hadjieleftheriou M, Kollios G, Gunopulos D, Tsotras VJ (2003) On-line discovery of dense areas in spatio-temporal databases. In: Advances in spatial and temporal databases, 8th international symposium. Springer, pp 306–324Google Scholar
- 15.Kollios G, Gunopulos D, Tsotras VJ (1999) On indexing mobile objects. In: Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. pp 261–272Google Scholar
- 16.Marsaglia G, Tsang W (2000) The ziggurat method for generating random variables. J Stat Software 5(8):1–7Google Scholar
- 17.Mokhtar H, Su J, Ibarra O (2002) On moving object queries: (extended abstract). In: Proceedings of the 21st symposium on principles of database systems. pp 188–198Google Scholar
- 18.Nascimento M, Pfoser D, Theodoridis Y (2003) Synthetic and real spatiotemporal datasets. IEEE Data Eng Bull 26(2):26–32Google Scholar
- 19.Papadopoulos D, Kollios G, Gunopulos D, Tsotras VJ (2002) Indexing mobile objects on the plane. In: Proceedings of the international conference on database and expert systems applicationsGoogle Scholar
- 20.Pelanis M, Saltenis S, Jensen CS (2006) Indexing the past, present, and anticipated future positions of moving objects. ACM Trans Database Syst 31(1):255–298CrossRefGoogle Scholar
- 21.Pfoser D, Jensen CS (2005) Trajectory indexing using movement constraints*. GeoInformatica 9:93–115CrossRefGoogle Scholar
- 22.Porkaew K, Lazaridis I, Mehrotra S (2001) Querying mobile objects in spatio-temporal databases. In: Proceedings of symposium on spatial and temporal databases (SSTD). pp 59–78Google Scholar
- 23.Revesz P (2002) Introduction to constraint databases, Springer, New York, USA.Google Scholar
- 24.Revesz P (2005) Efficient rectangle indexing algorithms based on point dominance. In: Proc. of the 12th international symposium on temporal representation and reasoning. IEEE Press, pp 210–212Google Scholar
- 25.Revesz P, Chen Y (2003) Efficient aggregation over moving objects. In: Proceedings of the 10th international symposium on temporal representation and reasoning. IEEE Press, pp 118–127Google Scholar
- 26.Rigaux P, Scholl M, Voisard A (2001) Spatial databases: with applications to GIS. Morgan Kaufmann, San Francisco, CA, USAGoogle Scholar
- 27.Saltenis S, Jensen CS, Leutenegger ST, Lopez MA (2000) Indexing the positions of continuously moving objects. In: Proc. of the ACM SIGMOD international conference on management of data. ACM Press, pp 331-342Google Scholar
- 28.Samet H (1990) The design and analysis of spatial data structures. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USAGoogle Scholar
- 29.Samet H (2005) Foundations of multidimensional and metric data structures. Morgan Kaufmann, San Francisco, CA, USAGoogle Scholar
- 30.Tao Y, Papadias D (2005) Historical spatio-temporal aggregation. ACM Trans Inf Sys 23(1):61–102CrossRefGoogle Scholar
- 31.Tao Y, Sun J, Papadias D (2003) Selectivity estimation for predictive spatio-temporal queries. In: Proceedings of the 19th international conference on data engineering, pp 417–428Google Scholar
- 32.Tayeb J, Ulusoy Ö, Wolfson O (1998) A quadtree-based dynamic attribute indexing method. Comput J 41(3):185–200CrossRefGoogle Scholar
- 33.Theodoridis Y, Silva J, Nascimento M (1999) On the generation of spatiotemporal datasets. In: Proc. symposium on spatial databases. Springer, pp 147–164Google Scholar
- 34.Trajcevski G, Wolfson O, Hinrichs K, Chamberlain S (2004) Managing uncertainty in moving objects databases. ACM Trans Database Syst 29(3):463–507CrossRefGoogle Scholar
- 35.Tzouramanis, T, Vassilakopoulos, M, Manolopoulos, Y. (2002) On the generation of time-evolving regional data*. Geoinformatica 6(3):207–231CrossRefGoogle Scholar
- 36.Zhang D, Gunopulos D, Tsotras VJ, Seeger B (2003) Temporal and spatio-temporal aggregations over data streams using multiple time granularities. Inf Syst 28(1–2):61–84CrossRefGoogle Scholar