# UV-diagram: a voronoi diagram for uncertain spatial databases

- First Online:

- Received:
- Revised:
- Accepted:

DOI: 10.1007/s00778-012-0290-x

- Cite this article as:
- Xie, X., Cheng, R., Yiu, M.L. et al. The VLDB Journal (2013) 22: 319. doi:10.1007/s00778-012-0290-x

- 5 Citations
- 1.7k Downloads

## Abstract

The Voronoi diagram is an important technique for answering nearest-neighbor queries for spatial databases. We study how the Voronoi diagram can be used for uncertain spatial data, which are inherent in scientific and business applications. Specifically, we propose the *Uncertain-Voronoi diagram* (or *UV-diagram*), which divides the data space into disjoint “UV-partitions”. Each UV-partition \(P\) is associated with a set \(S\) of objects, such that any point \(q\) located in \(P\) has the set \(S\) as its nearest neighbor with nonzero probabilities. The UV-diagram enables queries that return objects with nonzero chances of being the nearest neighbor (NN) of a given point \(q\). It supports “continuous nearest-neighbor search”, which refreshes the set of NN objects of \(q\), as the position of \(q\) changes. It also allows the analysis of nearest-neighbor information, for example, to find out the number of objects that are the nearest neighbors of any point in a given area. A UV-diagram requires exponential construction and storage costs. To tackle these problems, we devise an alternative representation of a UV-diagram, by using a set of *UV-cells*. A UV-cell of an object \(o\) is the extent \(e\) for which \(o\) can be the nearest neighbor of any point \(q \in e\). We study how to speed up the derivation of UV-cells by considering its nearby objects. We also use the UV-cells to design the *UV-index*, which supports different queries, and can be constructed in polynomial time. We have performed extensive experiments on both real and synthetic data to validate the efficiency of our approaches.

### Keywords

Voronoi diagramUncertain data Nearest-neighbor query## 1 Introduction

Is it possible to use the Voronoi diagram to perform nearest-neighbor search on objects whose values are imprecise? Data values can be uncertain for a variety of reasons. Consider a satellite image, which depicts geographical objects like airports, vehicles, and people. Using machine learning and human effort (e.g., community-based systems like Wikimapia), the location of each object on the image can be obtained. Due to the noisy transmission of satellite data, the quality of these images can be affected, and we may not be able to obtain very accurate locations. Moreover, if this location information is released to the public (e.g, for research purposes), it may need to be preprocessed for privacy purposes. In fact, recent proposals like [1, 2] have suggested to represent a user’s position as a larger region, in order to lower the likelihood that a user is identified at a particular site. Uncertainty is also inherent in biological data management. For example, microscopy images have been actively used to analyze the thickness of neuron layers in the retina, as well as the extent of the area of a cell. Due to factors like image resolution and measurement accuracy, it is hard to obtain exact values of the objects of interest [28, 29]. For this kind of data, techniques for evaluating range queries, nearest-neighbor queries, and joins have been developed. These queries return answers with probabilistic guarantees, which reflect the confidence of answers due to data uncertainty. For these applications, tools that resemble the Voronoi diagram can be potentially useful. Specifically, we would like to examine space-partitioning techniques for performing a *Probabilistic Nearest-Neighbor Query* (PNN). Given a query point \(q\), a PNN returns the IDs of objects with nonzero probabilities for being the closest to \(q\), as well as their probabilities. In the sequel, we denote the objects returned by the PNN as *answer objects*, and their probability values as *qualification probabilities*.

An uncertainty model that has been commonly used is to assume that an object \(O_i\) has an “uncertainty region” and a probability distribution function (pdf). This means that the precise position of \(O_i\) can only be located inside the (closed) region, with a pdf that describes the distribution of the object’s position within the region. The uncertainty region can have any shape, and the pdf is arbitrary (e.g., it can be a uniform distribution, Gaussian, or a histogram). Here, we assume that \(O_i\) has a two-dimensional circular uncertainty region. We will also explain how our solution can be extended to handle non-circular-shaped regions. We examine how the Voronoi diagram should be defined to support PNN execution. Specifically, we propose the *Uncertain-Voronoi* diagram (or *UV-diagram*), where the nearest-neighbor information of every point in the data space is recorded, based on the uncertain objects involved. The UV-diagram provides a basis for studying solutions that used the Voronoi diagram for point data. It could be interesting, for instance, to extend the solution of [44] to support uncertain data in broadcasting services. Figure 1b illustrates an example of the UV-diagram of seven uncertain objects, where the space is divided into disjoint regions called *UV-partitions*. Each UV-partition \(P\) is associated with a set \(S\) of one or more objects. For any point \(q\) located inside \(P\), \(S\) is the set of answer objects of \(q\) (i.e., each object in \(S\) has a nonzero probability for being the nearest neighbor of \(q\)). The highlighted regions contain points that have two or more nearest-neighbor objects. As an example, since \(q_1\) is inside the dashed region, \(O_4\) has a nonzero probability for being the nearest neighbor of \(q_1\); on the other hand, \(q_2\) is located inside the dotted region, and \(O_6\) and \(O_7\) are the answer objects for the PNN with \(q_2\) as the query point. Observe that the Voronoi diagram, which indexes on spatial points, is a special case of the UV-diagram, since a point can be viewed as an uncertainty region with a zero radius. Figure 1 compares the two diagrams.

The Voronoi diagram can also be used in other applications. For example, a *continuous nearest-neighbor query*, which constantly returns the nearest neighbor (e.g., gas station) of a moving point \(q\) (e.g., a vehicle), is a typical operation in location-based services [32, 43]. The Voronoi diagram supports this query; particularly, the Voronoi cell that contains the current location of \(q\) can be easily retrieved. We will illustrate how to use the UV-diagram to track the possible nearest neighbors of a moving point. Another use of the Voronoi diagram is to perform data analysis or observe interesting patterns of nearest-neighbor information. In [41], the Voronoi diagram is used to investigate the spreading pattern of bluetooth viruses among mobile users. We can also use UV-diagram to provide valuable information about these “nearest-neighbor patterns”. In Fig. 1b, if the dashed region is large, it indicates that \(O_4\) has high chance to be placed in different clusters (assuming that a nearest-neighbor clustering algorithm is used). Another interesting query is as follows: given a region \(R\), display all UV-partitions that intersect with \(R\), as well as the density of objects that can be the nearest neighbor in each UV-partition. Hence, a UV-diagram allows a user to visualize patterns about the nearest-neighbor information.

*Challenges of constructing UV-diagram*. Although the UV-diagram is useful, developing a UV-diagram is not simple. Notice that the UV-partitions are produced based on uncertainty regions, which may not be points. Unfortunately, efficient computational geometry methods for generating the Voronoi diagram (e.g., line-sweeping [19]) cannot be readily used for creating a UV-diagram, since these methods are primarily designed for spatial points, rather than uncertainty regions. Figure 2 depicts the space partition based on three uncertainty regions represented as circles. Each UV-partition (named \(R_i\), where \(i=1,\ldots ,7\)) is irregular in shape and contains different answer objects, listed on the side of the figure. In general, given a set of uncertain regions, an exponential number of UV-partitions can be created. For example, Fig. 2 shows that for three objects, there are seven UV-partitions, each of which contains one of \(2^3-1=7\) combinations of the three objects. Moreover, the number of edges of each UV-partition can also be exponentially large! This makes it computationally infeasible to generate and store these partitions. It is also difficult to find out which of these irregular UV-partitions contain a given query point. Indeed, our experimental results show that a brute-force approach of computing and indexing UV-partitions over 40,000 objects requires about 60 h. Therefore, a scalable method for constructing a UV-diagram is highly desirable.

*Our solutions*. Instead of computing UV-partitions, we have developed an alternative interpretation of the UV-diagram. For every object \(O_i\), we consider the extent \(a_i\) such that \(O_i\) can be the nearest neighbor of any point selected from \(a_i\). We call this extent the *UV-cell* of \(O_i\). We examine some basic properties of a UV-cell (e.g., its size and number of edges). We show how to represent a UV-cell as a set of objects, and develop novel methods to find this object set efficiently. For example, our *batch-construction* algorithm allows the UV-cells of objects that are physically close to each other to be swiftly obtained. We propose a polynomial time method for constructing an index for the UV-partitions, called the UV-index. We adopt an adaptive-grid indexing scheme, which has the advantage of adapting to different distributions of uncertain objects’ positions. Our experimental results show that on both synthetic and real dataset, this index can be constructed in a much shorter time. We also explain how to use the UV-index to support different applications (e.g., PNN and nearest-neighbor pattern queries).

Study the UV-diagram and its basic properties;

Propose efficient algorithms for obtaining a UV-cell;

Design the UV-index;

Use the UV-index to support different queries; and

Conduct experiments on real and synthetic datasets.

## 2 Related work

*Data uncertainty management*. Recently, researchers have proposed to consider uncertainty as a “first-class citizen” in a DBMS [13, 14, 18, 39]. Two models can be used to represent uncertain data: tuple- and attribute- uncertainty. For tuple uncertainty, each database tuple has a probability of being correct [18]. Here, we assume attribute uncertainty, which represents an attribute as a range of possible values and a probability distribution function (pdf) bounded in the range [39]. Common queries for attribute uncertainty include range queries [16], \(k\)-nearest-neighbors [28], skylines [25, 36], and top-\(k\) queries [20].

A few works have been proposed to evaluate PNN queries over attribute uncertainty. In [15], numerical integration techniques have been presented. *Probabilistic verifiers*, described in [13], can generate answer objects’ probability bounds without performing expensive integration operations. Another way to compute answer probabilities is based on sampling [24]. In this paper, we focus on the efficient retrieval of answer objects.

To our understanding, the only indexing method available for nearest-neighbor search over uncertain data is to use an index like the R-tree and the grid. The R-tree is a disk-based structure that uses the minimum bounding rectangles (MBRs in short) to cluster the uncertainty regions of the objects, and organizes MBRs in a hierarchical manner [6]. To evaluate PNN using the R-tree, a branch-and-prune strategy has been proposed in [15], where MBRs that may contain answer objects are traversed. However, this involves a lot of I/O cost in reading index nodes and leaf pages [13, 15]. Similar issues also occur with grids [31]. On the other hand, retrieving answer objects from the UV-diagram is essentially a point query search: given a point \(q\), find the objects associated with the UV-partition that contains \(q\). Hence, a UV-diagram can support more efficient PNN search. While it is not clear how an R-tree or grid over uncertain objects can provide pattern analysis of nearest-neighbor information (e.g., displaying the extent of a UV-partition), we will show how to use the UV-diagram to provide this information.

Other types of nearest-neighbor queries, like the “group nearest-neighbors” [26], “reverse-nearest-neighbors” [10, 27], “uncertain queries” [8], and “continuous nearest-neighbor queries” [12] have also been proposed. In these works, the R-tree was used to support object retrieval. It is interesting to study how the UV-diagram can be used to support the execution of these queries. In this paper, we study how to use the UV-diagram to support the execution of continuous nearest-neighbor queries.

*The Voronoi diagram* is an important technique for answering nearest-neighbor queries over spatial points [33]. It has been extended to support other applications (e.g., [7, 32, 43–45]). It also facilitates the analysis of spreading patterns of mobile viruses [41]. In [9], the \(k\)-th order Voronoi diagram is used to evaluate a \(k\)-NN query. In [38], an index called VoR-Tree is designed to merge Voronoi diagrams into R-tree in order to answer various nearest-neighbor queries. The Voronoi diagram has also been defined for boundaries of circular objects in [23]. However, these objects are *not* uncertain, and the method of [23] cannot be used to answer PNN queries.

Few works have studied the application of the Voronoi diagram on uncertain data. [8] Consider the “uncertain” nearest neighbor query (UNN) over spatial points. Different from PNN, the query is an uncertain region, not a query point. To evaluate a UNN, the authors propose to use a Voronoi diagram over 2D points. The portions of the Voronoi cells that overlap with the query’s uncertainty region are then used to compute answer probabilities. [22] Consider the clustering of uncertain attribute data, where a Voronoi diagram is constructed for centroid points. Notice that [8, 22] do not construct a Voronoi diagram for uncertain data. On the other hand, the UV-diagram is a Voronoi diagram tailored for attribute uncertainty.

In [21, 37], the Voronoi diagram was modified to identify an imprecise object that is surely the nearest object of a query point \(q\). However, the UV-diagram returns object(s) that *may* have chance to be the nearest neighbor of \(q\), and can be used to answer probabilistic nearest-neighbor queries. We also study a database index for the UV-diagram, which has not been examined in these two works.

This paper is an extension of [17]. Here, we improve the performance of UV-index construction by proposing *batch pruning*, which reduces the workload of generating UV-cells for a set of nearby objects. We provide a more detailed study of the basic properties of a UV-cell (e.g., its size and number of edges). We also examine how the UV-index can be used to answer PNN queries for a moving query point. We conduct new experiments to validate the effectiveness of these approaches.

## 3 The UV-diagram

We now present the basic notions of the UV-diagram.

We introduce the UV-cell, an alternative presentation of the UV-diagram, in Sect. 3.1. We then study some applications of the UV-diagram, in Sect. 3.2.

### 3.1 The UV-cell

As discussed before, the UV-diagram can be expensive to construct. We hence propose an alternative representation of the UV-diagram, by using *UV-cells*. We will later explain how the UV-cells facilitate efficient construction of the UV-diagram. Now, let \(O_1,O_2,\ldots ,O_n\) be the IDs of a set \(O\) of uncertain objects, and \(D\) be a two-dimensional space that contains these objects. For simplicity, we assume that \(D\) is a square. The UV-cell is then defined as follows:

**Definition 1**

A **UV-cell** of \(O_i\), denoted by \(U_i\), is a region in \(D\) such that \(O_i\) has a nonzero probability to be the nearest neighbor (NN) of a point \(q\), where \(q \in U_i\).

Figure 2 illustrates the UV-cells for \(O_1\), \(O_2\), and \(O_3\). The boundary of each UV-cell is labeled with the ID of the object. For example, the UV-cell of \(O_2\) is a region enclosed by solid-line segments.

The UV-cell can be used to recover the UV-partitions (i.e., disjoint regions of a UV-diagram). In fact, a UV-partition that contains \(q\) is the *intersection* of all UV-cells that contain \(q\). This is because the objects associated with these UV-cells have nonzero qualification probabilities for \(q\). For instance, in Fig. 2, the UV-cells of both \(O_1\) and \(O_3\) intersect at partitions \(R_5\) and \(R_7\). This means that when \(q\) is located at any of these partitions, both \(O_1\) and \(O_3\) are the answer objects. Since \(R_7\) is intersected by \(O_2\)’s UV-cell, \(O_2\) is also associated with \(R_7\). Therefore, a UV-diagram is the union of all objects’ UV-cells. Besides, the UV-cells of all objects can be used to output which object(s) is/are the nearest neighbor of \(q\) with nonzero probabilities.

Notations and meanings

Notation | Meaning |
---|---|

| |

\(D\) | Domain space (a square) |

\(O\) | A set of uncertain objects (\(O_1,O_2,\ldots ,O_n\)) |

\(MBC(O_{i})\) | Minimum bounding circle of object \(O\) |

\((c_i,r_i)\) | Center and radius of \(O_i\)’s uncertainty region |

\(q\) | Query point of a PNN |

\(\rho \) | Density of objects in \(D\) |

| |

\(\odot (c,r)\) | A circle centered at \(c\) with radius \(r\) |

\(dist(q,c_i)\) | Euclidean distance between \(q\) and \(c_i\) |

\(dist_{min}(q,O_i)\) | Minimum distance of \(O_i\) from \(q\) |

\(dist_{max}(q,O_i)\) | Maximum distance of \(O_i\) from \(q\) |

\(U_i\) | UV-cell of \(O_i\) |

\(P_i\) | Possible region of \(O_i\) |

\(E_i(j)\) | UV-edge of \(O_i\) w.r.t. \(O_j\) |

\(X_i(j)~~(~\overline{X_i(j)}~)\) | Outside (inside) region of \(O_i\) w.r.t. \(O_j\) |

\(F_i\) | r-Objects of \(O_i\), where \(F_i \subseteq O\) |

\(C_i\) | cr-Objects of \(O_i\), where \(C_i \subseteq O\) |

\(M\) | Maximum no. of non-leaf nodes |

\(s\) | Estimated size of a UV-cell |

\(T_\theta \) | Split threshold |

### 3.2 Applications of the UV-diagram

The UV-diagram supports a number of applications. Let us now explain how to use the UV-diagram to handle the following queries:

*1. The Probabilistic Nearest-neighbor (PNN) Query*. This query has been mentioned in Sect. 1. To evaluate a PNN for a given point \(q\), we can find out the UV-partition that contains \(q\). The set \(A\) of objects associated with this partition are those that can be the nearest neighbor of \(q\). Notice that the UV-partitions can be obtained by finding the union of all the UV-cells. For each object \(O_i \in A\), the probability that \(O_i\) is the closest to \(q\) can be efficiently evaluated by using solutions in [13, 15, 24].

*2. The Continuous PNN Search (CPNN)*, an extension of the PNN, is a query that resides in the processing server for an extensive period of time. Different from PNN, the position of a query point \(q\) changes with time [12]. The objective of the CPNN is to refresh the PNN answer, when the value of \(q\) changes. This query can be used in transportation services, where \(q\) can be a moving vehicle or person, and the data can be the geographical objects retrieved from satellite images. Assuming that \(q\) reports its position to the server periodically, the UV-diagram can conveniently support CPNN. Suppose that the server receives a new position of \(q\), say, \(q_1\). A simple solution is to issue a new PNN for \(q_1\). However, if \(q_1\) is located in the same UV-partition that contains the old position of \(q\), then it suffices to use the objects associated with that UV-partition to compute the query answer for \(q_1\). The cost of retrieving the UV-partition that contains \(q_1\) is thus saved.

*3. The UV-partition Query.* The UV-diagram can also be used to retrieve the distribution and pattern information about nearest neighbors, which can be useful for analysis purposes (e.g., [41]). One such “pattern-analysis” operation is the *UV-partition query*. Given a region \(R\), this query retrieves all UV-partitions inside \(R\) and the “density” of each partition \(R_j\) (which is equal to the number of objects associated with \(R_j\), divided by the area of \(R_j\)). This allows a user to examine the density distribution of the nearest neighbors in his/her interested area \(R\).

*4. The UV-cell Query.* This is another pattern analysis operation. Given an object \(O_{i}\), it returns the extent and the area of \(O_{i}\)’s UV-cell. The query user can then obtain the area of the region where \(O_{i}\) may be the nearest neighbor. This area can reflect the “influence” of \(O_{i}\) (in terms of the nearest neighbor information). The shape of the UV-cell can also be displayed on the user’s computer screen for further analysis.

Since the UV-diagram is expensive to construct, in Sect. 6, we revisit how the above queries can be implemented by the UV-index, which is an approximate version of the UV-diagram. We next address the UV-cell in detail.

## 4 More about UV-cells

We now investigate the UV-cell, which is important for constructing the UV-index, in more details. We first present a simple method for constructing a UV-cell in Sect. 4.1. We then examine the shape of a UV-cell in Sect. 4.2. The number of edges of a UV-cell, and its size, is studied in Sects. 4.3 and 4.4, respectively.

### 4.1 Constructing a UV-cell

*answer object*(i.e., an object with nonzero qualification probabilities). In Fig. 3, \(q_0\) is on the right of \(E_i(j)\), which is also closer to \(O_j\) than \(O_i\). Thus, \(dist_{max}(O_j,q_0) < dist_{min}(O_i,q_0)\). In other words, \(O_j\) is

*always*closer to \(q_0\) than \(O_i\), and \(O_i\) has no chance to be the nearest neighbor of \(q_0\). As another example, \(q_1\) is on the left of \(E_i(j)\). Since \(dist_{min}(O_i,q_1) \le dist_{max}(O_j,q_1)\), \(O_i\) has a nonzero qualification probability. Hence, given \(E_i(j)\), if the query point is on the right of \(E_i(j)\), \(O_i\) can be pruned.

We can now present a simple method for constructing a UV-cell. Let us define the *outside region*:

**Definition 2**

The outside region of UV-edge \(E_i(j)\), denoted by \(X_i(j)\), is the region on one side of \(E_i(j)\) such that for any point \(q \in X_i(j)\), \(O_j\) is always closer to \(q\) than \(O_i\). We call the complement of \(X_i(j)\) the inside region, denoted by \(\overline{X_i(j)}\).

**Definition 3**

According to the definition, the possible region should be an area that completely covers the UV-cell of \(O_i\). An example of an object’s possible region is the domain \(D\), since \(D\) must cover any UV-cell. Here, \(R\) is the empty set. Notice that a UV-cell is also a possible region.

In Fig. 3, the outside region of the UV-edge \(E_i(j)\) is the area on the right of the solid line. Notice that since \(q_0\) is in the outside region of \(E_i(j)\), \(O_j\) is closer to \(q_0\) than \(O_i\), and thus \(O_i\) cannot be \(q_0\)’s nearest neighbor.

Case 1: \(s\) is inside \(P_i\): We refine \(P_i\) by using \(s\) as one of the new edges of \(P_i\). Some existing edges of \(P_i\) are removed if necessary.

Case 2: \(s\) is outside \(P_i\) (except the end points of \(s\)): \(P_i\) cannot be changed by \(s\), and we do not have to do anything to handle this case.

*Case 1*is satisfied, the existing edges between \(v_4\) and \(v_5\) (i.e., (\(v_4, v_1^{\prime }\)) and (\(v_1^{\prime },v_5\))) are replaced by segment (\(v_4\),\(v_5\)). On the other hand, (\(v_5,v_6\)) is outside \(P_i\), and so Case 2 is satisfied. There is no need to change any edges of \(P_i\) between \(v_5\) and \(v_6\). The process is repeated until all the segments are visited. As shown in Fig. 4b, vertices \(v_1^{\prime },v_2^{\prime }\), and \(v_3^{\prime }\) are removed from \(P_i\), while \(v_4,\ldots ,v_9\) are added to it.

Note that the order of selecting the object for refining \(O_i\)’s possible region (Steps 4–6) does not affect the correctness of the algorithm; the UV-cell is produced by “shrinking” the possible regions by using the outside regions of other objects. Also, not all objects are useful in shaping a UV-cell. In Sect. 5, we will explain how to prune away these objects.

### 4.2 The shape of a UV-cell

We now study a mathematical representation of the UV-cell. We also derive the number of UV-edges of a given UV-cell. Here, we assume that the uncertainty region of \(O_i\) is a circle, with center \(c_i\) and radius \(r_i\), with \(r_{i}>0\). Later, we discuss the UV-cell of a “point uncertainty” (i.e., \(r_{i}=0\)), and also uncertainty regions that are not circle in shape.

\(a=\frac{r_i + r_j}{2}\), \(c=\frac{dist(c_i,c_j)}{2}\), and \(b=\sqrt{c^2-a^2}\);

\(x_{\theta }=(x - f_x)\cos \theta + (y - f_y)\sin \theta \);

\(y_{\theta }=(f_x - x)\sin \theta + (y - f_y)\cos \theta \).

Equation 7 shows that a UV-cell is composed of the intersections of one or more UV-edges, which are hyperbolas. Since a hyperbola is a conic curve, an UV-edge must be *concave* in shape. In Fig. 2, apart from the edges of the domain space, the UV-cells of the three objects have concave edges. Note that Eq. 7 has two curves, which represent the UV-edges for each pair of objects involved. For example, in Fig. 3, the solid line is the UV-edge of \(O_i\) w.r.t. \(O_j\), and the dotted line is the UV-edge of \(O_j\) w.r.t. \(O_i\).

If two objects overlap, then \(dist(c_i,c_j) < r_i+r_j\), and in Eq. 7, \(b\) is not a real number. Physically, this means \(E_i(j)\) cannot be found, and we can treat \(X_i(j)\) as an empty region.

We now revisit Algorithm 1. Step 4 is done using Eq. 7. Step 5 is performed by observing that the outside region of a UV-edge must be convex in shape. To perform Step 6 (i.e., cutting the possible region by an outside region), we compute the intersections of hyperbola equations using linear algebra techniques [3], which are detailed in Appendix 9.

Let us state an interesting observation about a possible region, which we will use later.

**Lemma 1**

The possible region of an uncertain object is a connected region without any hole inside it.

The proof of this lemma, detailed in Appendix 10, shows that a contradiction will result if a possible region contains a hole. We next discuss the shape of the UV-cell for other kinds of uncertainty regions.

*(1) Point uncertainty.*Given two objects \(O_{i}\) and \(O_{j}\), suppose that at least one of them has no uncertainty, i.e., \(r_{i}\) or \(r_{j}\) is equal to zero. There are two scenarios:

If \(c_i\ne c_j\), without loss of generality, assume that \(r_{i}=0\). Then, \(E_i(j)\) can be obtained by Eq. 7, because all variables used in that equation are real numbers, and \(a,b\) are nonzero values. Notice that \(E_i(j)\) becomes a perpendicular line segment when \(r_i=r_j=0\).

If \(c_i=c_j\), then \(E_i(j)\) does not exist. If \(r_i\ne r_j\), Eq. 1 does not hold, and the UV-cell of \(O_{i}\), or \(U_{i}\), does not exist. If \(r_{i}=r_{j}\), Eq. 1 always holds, and \(U_{i}=D\).

*(2) Non-circular uncertainty regions.*To find the UV-cells for non-circular uncertainty regions, our first attempt is to derive the UV-edges for objects with rectangular uncertainty regions. As shown in Fig. 5, the UV-edge between objects \(O_{1}\) and \(O_{2}\) is a piecewise-quadratic line segments. This is too expensive to compute and store. Instead, for each object \(O_{i}\), we convert its non-circular uncertainty region to a circle, \(MBC(O_{i})\), which minimally contains it. Then, we use Algorithm 1 to construct the UV-cells for these circles. We claim that \(MBC(O_{i})\)’s UV-cell always covers that of \(O_{i}\).

### 4.3 The number of UV-edges of a UV-cell

Let us now examine the number of UV-edges of a UV-cell. As Algorithm 1 shows, for every object \(O_{i}\), its UV-edge with respect to other objects is used to refine its possible region \(P_{i}\) (Step 6). This requires computing the intersections of all edges of \(P_i\) with a new UV-edge \(E_i(j)\), for some object \(O_j\).

As shown in Fig. 4b, \(E_i(j)\) intersects with \(U_i\)’s UV-edge \((v_1^{\prime },v_2^{\prime })\) at \(v_5\) and \(v_6\). Thus, \((v_1^{\prime },v_2^{\prime })\) is replaced by three edges: \((v_4,v_5)\), \((v_5,v_6)\), and \((v_6, v_7)\).

From this example, we can see that \(E_i(j)\), a hyperbolic curve, can have at most 2 intersections with a UV-edge of \(P_i\); and 3 new edges can be created for \(P_i\) as a result. In the worst case, the number of edges of \(P_i\) increases by 3 times whenever a new edge is considered.

In general, to obtain \(U_{i}\), we have to take into account \(n-1\) objects. Hence, the number of edges of the UV-cell has an (exponential) upper bound of \(O(3^{n})\).

Moreover, computing intersections between hyperbolas is complex. In our implementation, 60 h are needed to create a UV-diagram of 40,000 objects by using Algorithm 1. We will explain how to find an efficient representation of the UV-cell, in Sect. 5.

### 4.4 The size of a UV-cell

*hexagonal lattice model*, where each object has six neighbors whose centers are equidistant from each other, with distance \(d_{0}\).

^{1}We assume that the uncertainty region sizes of all objects are the same, with a radius of \(r\). Figure 6 illustrates seven objects configured in this manner. Given an object \(O_{1}\), we assume that the UV-cell \(U_{1}\) of \(O_{1}\) is not trimmed by the boundary of the domain space. That is, its UV-cell is solely determined by the uncertainty regions of other objects. Our goal is to find the dimension \(s\) of a square that contains \(U_{1}\). This square should be a good approximation of \(U_{1}\).

^{2}

Let \(H(d)\) be a set of six objects whose uncertainty region’s centers have the same distance \(d\) from that of \(O_1\). For example, Fig. 6, the centers of the uncertainty regions of \(H(d)=\{O_{2},\ldots ,O_{7}\}\) are \(d\) units away from that of \(O_{1}\). We claim the following:

**Lemma 2**

In the sequel, we use \(s(d)\) to denote the size of \(P_{1,d}\). The proof of Lemma 2 can be found in Appendix 11. Notice that \(P_{1,d}\) contains \(U_{1}\).

**Theorem 1**

If \(d_{0} > 2\sqrt{3} r\), then the square that minimally contains \(U_{1}\) has a size of \(s(d_{0})\) obtained from Eq. 8.

The main idea of the proof is that when \(d_{0} > 2\sqrt{3} r\), the six objects that form \(HEX_{1}\) alone contribute to the edges of \(O_{1}\)’s UV-cell. Its details can be found in Appendix 12.

*An iterative approach of finding*\(d^{*}\). We now explain how to derive the size of a square that contains \(U_{1}\), for any value of \(d_{0}\). Our goal is to find \(d^*\) among different values of \(d\), such that the square covering \(P_{1,d^*}\) is the smallest. We observe from the first-order derivative of \(s\) from Eq. 8 that \(d^{+}=2\sqrt{3}r\) is the only inflexion point, such that \(s\) monotonously decreases when \(\frac{4r}{\sqrt{3}}<d<d^+\), and monotonously increases when \(d\ge d^+\). However, this result cannot be readily used, since we may not be able to find six neighbors of \(O_{1}\) that are exactly \(d^{+}\) units apart from each other. We thus estimate \(d^*\) as follows. We first consider the objects on \(HEX_{1}\), and compute \(s(d_{0})\). We then consider \(HEX_{2}\), where each vertex is \(\sqrt{3}d_{0}\) from \(c_{1}\), and evaluate \(s(\sqrt{3}d_{0})\). We repeat this process, until the six objects found are \(d_{x}\) units apart from each other, where (1) \(d_x > \frac{4r}{\sqrt{3}}\) and (2) \(s(d_x) < s(\sqrt{3}d_x)\). Then, we set \(d^*=d_{x}\), and use Lemma 2 to find \(s\).

The above process only examines the values of \(d\) at \(d_{0}, \sqrt{3}d_{0},(\sqrt{3})^{2}d_{0},\ldots , \sqrt{|D|}\). Hence, at most \(\Big \lceil log_{\sqrt{3}}(\sqrt{|D|}/d_0)\Big \rceil \) trials are needed to find \(d^{*}\). Although this procedure does not find the square that tightly contains a UV-cell, our experiments show that the approximation is highly accurate.

## 5 Efficient UV-cell generation

Since generating a UV-cell is inefficient, our strategy is to avoid computing it directly. Instead, we represent a UV-cell as a set of *candidate reference objects* (cr-objects), which can be efficiently derived. As will be discussed in Sect. 6, cr-objects can be used to develop an approximate representation of the UV-diagram. Section 5.1 outlines the algorithm of yielding cr-objects. We explain the preparation phase of this algorithm as well as two techniques for finding these objects quickly, in Sects. 5.2 and 5.3, respectively. Section 5.4 discusses how to derive cr-objects efficiently for a group of nearby objects.

### 5.1 Reference objects and candidate reference objects

Recall from Algorithm 1 that the UV-cell of an object \(O_i\), that is, \(U_i\), is the result of repeatedly subtracting the outside region of other objects (i.e., \(X_i(j)\)) from its possible region, \(P_i\). In fact, not all outside regions are useful for refining \(P_i\). In particular, if the UV-edge of \(O_i\) corresponding to \(O_j\), that is, \(E_i(j)\), does not intersect with \(P_i\), then \(P_i\) cannot be shrinked by \(X_i\)(j). We call an object \(O_j\) a *reference object* (or *r-object*) of \(O_i\), if \(O_j\) defines an edge of \(O_i\)’s UV-cell. We also denote \(F_i \subseteq O\) to be the set of r-objects of \(O_i\). The set \(F_i\) contains objects whose outside regions are responsible for defining the UV-cell of \(O_i\). In Fig. 2, for example, the set of r-objects of \(O_3\), that is, \(F_3\), is \(\{O_1,O_2\}\).

Given that the r-objects for each object are known, our solution (to be shown in Sect. 6) can use r-objects to develop an alternative representation of the UV-diagram. This solution is much cheaper than Algorithm 1, which requires exact UV-cells to be computed. However, finding \(F_i\) itself is difficult because we do not know the UV-cell of \(O_i\). Our strategy is to find a small set \(C_i\) of objects, where \(F_i \subseteq C_i\). We call \(C_i\) the *candidate reference objects* (or *cr-objects* in short). We next show how \(C_i\) can be derived without acquiring the exact UV-cell of \(O_i\). In Sect. 6, we study an indexing solution based on cr-objects.

### 5.2 Generating a possible region

In Step 1 of Algorithm 2, we retrieve a small number of objects, called *seeds*, from a set of objects \(S\). These seeds are used to generate an “initial” possible region, using a routine similar to Steps 3–7 of Algorithm 1. This region is used by other pruning methods to produce cr-objects.

*Step (i).*We issue a \(k\)-Nearest-Neighbor Query (\(k\)-NN) on \(S\), by using the center \(c_i\) of \(O_i\)’s uncertainty region as the query point. The \(k\) objects, which are not \(O_i\) and whose regions’ minimum distances from \(c_i\) are the shortest, are obtained. Since these objects are close to \(O_{i}\), we consider them to have a good chance for defining the UV-edges of \(U_{i}\). They are thus good candidates for being seeds. Note that if \(S=O\), then the R-tree on \(O\) can be used to support the \(k\)-NN search.*Step (ii).*Out of the \(k\) objects obtained from Step (i), we select \(k_{s}\) seeds. These objects are chosen in way such that they are evenly spread, in order to generate a “good” possible region. In particular, we divide the domain \(D\) into \(k_s\) equally sized sectors, centered at \(c_i\). For each sector, the object closest to \(c_i\) is a seed.

*Model-based seed selection.*We can use the results in Sect. 4.4 to estimate the value of \(k\) derived from

*Step (i)*. We assume that all the objects in domain \(D\) follow the hexagonal lattice model. First, we find the size \(s\) of the square that bounds the UV-cell of \(O_{i}\). Particularly, we check whether the condition for Theorem 1 is satisfied. If this is true, we let \(d_{min}=d_{0}\). Otherwise, we use the iterative approach, described in Sect. 4.4, to find \(d^{*}\), and let \(d_{min}=d^{*}\). Then, we find \(s(d_{min})\) by using Lemma 2. Figure 6 shows that the maximum distance of any point on the possible region \(P_{i}\) from the center \(c_{i}\) of \(O_{i}\)’s uncertainty region is \(\frac{s}{2}\). If we draw a circle of radius \((s-r)\), centered at \(c_{i}\), then Theorem 3 (to be discussed in Sect. 5.3) tells us that only objects located in this circle can be the reference objects \(F_{i}\). We can then estimate \(k\) as the expected number of objects in \(\odot (c_{i}, s-r)\):

### 5.3 I-Pruning and C-Pruning

Once the possible region has been initialized, we perform *I-pruning* and *C-pruning* (Steps 2 and 3 of Algorithm 2), in order to remove objects that cannot constitute a UV-edge to the UV-cell. Let us now examine these two steps in more details.

*Step 2: Index Level Pruning.* To understand this step, let us consider an object \(O_i\), its possible region \(P_i\), and another object \(O_j\), which has not yet been considered for refining \(P_i\). Our goal is to establish the necessary and sufficient condition(s) for \(O_j\) to have effect on the shape of \(P_i\). We first claim the following.

**Lemma 3**

\(P_i = P_i - X_i(j)\), if and only if for every point \(p\) inside \(P_i\), \(dist_{max}(p, O_j)> dist_{min}(p, O_i)\).

*Proof*

**(If)** For every \(p \in P_{i}\), \(p\) cannot be on \(X_{i}(j)\). If this is false, then \(O_{j}\) is always closer to \(p\) than \(O_i\), i.e., \(dist_{max}(p, O_j) \le dist_{min}(p, O_i)\) (Definition 2). This violates the condition that \(dist_{max}(p, O_j)> dist_{min}(p, O_i)\). Hence, \(p \notin X_{i}(j)\), and \(P_i = P_i - X_i(j)\).

**(Only if)** Suppose there exists a point \(p^{\prime }\) inside \(P_i\), such that \(dist_{max}(p^{\prime },O_j) \le dist_{min}(p^{\prime },O_i)\). Then \(O_j\) is always closer to \(p^{\prime }\) than \(O_i\), and \(O_i\) cannot be the nearest neighbor of \(p^{\prime }\). This implies that \(p^{\prime }\) must be excluded from \(P_i\) after \(O_j\) is considered. Hence, \(P_i\) cannot be equal to \(P_i-X_i(j)\), resulting in a contradiction.

Let \(b(P_i)\) be the boundary of \(P_i\). We have:

**Theorem 2**

\(P_i = P_i - X_i(j)\) if and only if for every point \(p \in b(P_i)\), \(dist_{max}(p, O_j) > dist_{min}(p, O_i)\).

*Proof*

**(If)**Let us first show that:

Hence, Eq. 10 is true. Using Lemma 3, we see that \(P_i = P_i - X_i(j)\), and the so the “if” part is correct.

**(Only if)** From Lemma 3, we know that for every point \(p \in P_{i}, dist_{max}(p, O_j)> dist_{min}(p, O_i)\). Since \(b(P_{i}) \subseteq P_{i}\), the “only if” part is correct.

Essentially, if we want to examine whether \(O_j\) has any effect on \(P_i\), it suffices to consider the points on \(P_i\)’s boundary, instead of all points in \(P_i\). We next present the following theorem, which forms the basis of I-pruning.

**Theorem 3**

Given an object \(O_i\) with center \(c_i\) and radius \(r_i\), let \(w\) be the maximum distance of \(P_i\) from \(c_i\). Let \(C_{out}\) be a circle, with center \(c_i\) and radius \(2w-r_i\). For another object \(O_j\), if \(c_j \notin C_\mathrm{out}\), then \(P_i = P_i - X_i(j)\).

*Proof*

The I-pruning method uses Theorem 3 by issuing a circular range query, centered at \(c_i\) with radius \(2w-r_i\), on the dataset \(O\). This operation can be easily implemented by using the R-tree created for \(O\). The range query first uses the R-tree to filter all objects that do not overlap with the range. For the remaining objects, they are removed if their centers are beyond the circular range. Hence, in this phase (Step 2 of Algorithm 2), a cost of \(O(n)\) is needed.

*Step 3: Computational Level Pruning.*

We next discuss a method, based on distance comparison, to check whether object \(O_j\) can affect the possible region of object \(O_i\). We call this method *C-pruning* (Step 3 of Algorithm 2). Theorem 4, discussed below, serves as the foundation of C-pruning.

**Theorem 4**

Given an uncertain object \(O_i(c_i,r_i)\) and \(P_i\)’s convex hull \(CH(P_i)\), let \(v_1, v_2, \ldots , v_n\) be \(CH(P_i)\)’s vertex. If another object \(O_j\)’s center \(c_j\) is not in any of

\(\{\odot (v_m, dist(v_m, c_i))\}_{m=1}^n\), then \(P_i=P_i-X_i(j)\).

*Proof*

For convenience, let \(\odot (p, dist(p, c_i))\) be a \(w\)-bound (where \(w=dist(p, c_i)\)). We also define a set \(S\) of \(w\)-bounds for every point \(p\) in \(U_i\). We now show that instead of checking all the \(w\)-bounds in \(S\), it is only necessary to check those \(w\)-bounds constructed for the vertices of \(CH(P_i)\). Specifically, the \(w\)-bounds of the vertices must contain all other \(w\)-bounds of all points on the boundary of \(CH(P_i)\). To see this, let \(w_k\) be the distance of vertex \(v_k\) from \(O_i\)’s center. We extend each vertex \(v_k\) by the distance \(w_k\) to obtain a new vertex \(v_j^{\prime }\) (black dot in Fig. 8b). These new vertices are connected to form a polygon. We use \(e_1\) and \(e_2\) to represent the \(w\)-bounds \(\odot (v_1, w_1)\) and \(\odot (v_2, w_2)\), respectively.

We next show that, for any point \(v^{\prime }\) on \(CH(P_i)\)’s edge \(v_1v_2\), \(\odot (v^{\prime }, dist(v^{\prime }, c_i)) \subseteq e_1 \cup e_2\), where we let \(e^{\prime }=\odot (v^{\prime }, dist(v^{\prime }, c_i))\). We draw a line \(c_1c_1^{\prime }\), which is perpendicular with \(v_1v_2\) and \(v_1^{\prime }v_2^{\prime }\), and intersects them at points \(c_1\) and \(c_1^{\prime }\), respectively. As \(v_1v_2\) is the perpendicular bisector of \(c_ic_1^{\prime }\), we see that \(c_ic_1^{\prime }\) is the common chord of \(e_1\), \(e_2\) and \(e^{\prime }\). Since \(e_1\) or \(e_2\) is bigger than \(e^{\prime }\), \(e^{\prime }\) is contained by \(e_1\) or \(e_2\).

Hence, to check whether \(O_j\) can refine \(P_i\), we just need to check the set of \(w\)-bounds \(S^{\prime }=\{\odot (v_m, dist(v_m, c_i))\}\) (where \(S^{\prime }\subseteq S\)). If \(c_j\) is located outside all \(w\)-bounds in \(S^{\prime }\), then \(CH(P_i)=CH(P_i)-X_i(j)\). Finally, since \(P_i\) is completely covered by \(CH(P_i)\), \(P_i=P_i-X_i(j)\) must also be true. This completes the proof.

Step 3 of Algorithm 2 uses Theorem 4 to prune unqualified objects returned by I-pruning. This can be done efficiently, because only the vertices of \(CH(P_i)\) are used. Moreover, \(|CH(P_i)|\) is small, since the possible region is only derived by a small number \(k_s\) of seeds. The complexity of this phase is \(O(n)\).

We consider the objects that remain after this step as cr-objects (i.e., \(C_i\)). The complexity of Algorithm 2, for generating \(C_i\)’s of \(n\) objects, is \(O(n(n+k))\).

### 5.4 Batch processing of cr-objects

To create the UV-index, we first find out the cr-objects for each of the \(n\) database objects. A simple way to do this is to run Algorithm 2 (i.e., getcrObject(\(O_i, O)\)) for all objects \(O_i \in O\), as proposed in [17]. However, this involves running getcrObject for \(n\) times and can be quite costly. We now present a *Batch Processing* algorithm (or BP in short), where the cr-objects of a group of objects are considered together. As we will show, this new algorithm allows the effort of devising an object’s cr-objects to be shared by others, and consequently reduces a lot of cr-object generation overhead.

Observe that if an object \(O_i\) is near to \(O_j\), then their UV-cells should be similar. The cr-object set of \(O_i\), that is, \(C_i\), can then be similar to \(C_j\). The BP makes use of this principle; it employs \(C_i\) to derive \(C_j\), instead of generating \(C_i\) and \(C_j\) independently. Let \(G\) be a set of objects that are physically close to each other. The BP first computes a set of objects \(C_G\), a superset of \(C_i\), for every \(O_i \in G\). The cr-objects of objects in \(G\) are then extracted from \(C_G\). Usually, \(C_G\) is smaller than the database size \(|O|\), and thus retrieving cr-objects from \(C_G\) is faster than from \(O\).

Step 2 invokes a slightly modified version of getcrObject to obtain a cr-object set \(C_G\) of \(O_G\). Particularly, in Step (i) of initPossibleRegion, the \(k\)-NN search skips all objects in \(G\). Notice that initPossibleRegion computes the possible region of an object. In Step (i) of that procedure, we obtain the seeds – objects that are useful for generating a UV-cell. In Algorithm 3, the input of getcrObject is \(O_G\), whose uncertainty region includes the uncertainty regions of all objects in \(G\). Therefore, the uncertainty region of any object \(O_i \in G\) overlaps with that of \(O_G\). More importantly, \(O_i \in G\) is not useful for finding possible regions of \(O_G\), because \(O_i\) does not create any UV-edge for \(O_G\)’s UV-cell. We next claim the following:

**Theorem 5**

Given an object \(O_j\), if \(O_j \notin C_G\) after Step 2 of Algorithm 3, then \(O_j \notin F_i\), where \(O_i \in G\).

That is to say, any object not contained in \(C_G\) cannot be an r-object of \(O_i \in G\). In other words, \(C_G\) is a superset of r-objects for all the objects in \(G\). The proof of this theorem, which is quite complex, is detailed in Sect. 5.5. Notice that all objects in \(G\) are included in \(C_G\) after the execution of Step 2. This is because in the last step of getcrObject (Algorithm 2), objects whose centers are located in the c-pruning bound of \(O_G\) are treated as cr-objects. Since the center of an object in \(G\) is inside \(O_G\)’s c-pruning bound, it must also be a cr-object of \(O_G\). Thus, \(G \subseteq C_G\).

Steps 3–6 use \(C_G\) to generate cr-objects for each object \(O_i \in G\). From Theorem 5, we know that an object in \(C_G\) may be an r-object of \(O_i\). Thus, objects in \(C_G\) can be considered as good candidates for generating an initial possible region, \(P_i\) for \(O_i\). We thus pass \(C_G\) to initPossibleRegion and get \(P_i\) (Step 4). We then execute cPrune on \(C_G\) to retrieve \(C_i\) (Step 5). These two steps are repeated for all objects in \(G\), until we obtain their cr-objects.^{3}

*The *LP* algorithm.* We now discuss a way to use Algorithm 3 to construct cr-objects for \(O\). The *Leaf-Node Processing*, or LP, performs a preorder traversal of the R-tree that indexes \(O\). When a leaf node, say \(N\), is reached, BP is invoked on all objects stored in \(N\), in order to compute their cr-objects. The algorithm terminates when all leaf nodes have been exhausted.

The LP can generate cr-objects for \(O\) quickly. This is because when the BP is called, it always uses the objects located in a leaf node. In an R-tree, the leaf node consists of a set \(G\) of objects, which are physically close to each other. Recall that the object created in Step 1 of BP (i.e., \(O_G\)) is the MBC of the uncertainty regions of objects in \(G\). Thus, the size of \(O_G\) would not be very different from those of the objects in \(G\). Consequently, the set \(C_G\) derived from Step 2 (getcrObject) should also be similar to the r-objects of \(G\)’s objects. In our experiments, \(|C_G|\) is much smaller than \(|O|\). Hence, Step 4 can be carried out more efficiently than if initPossibleRegion is carried out on \(O\) for every object.

We have introduced an efficient construction method to derive the cr-object set \(C_i\) for \(O_i\). We have also explained how to obtain cr-objects for \(O\) quickly. One may consider to use \(C_i\) to generate the exact UV-cell of \(O_i\). However, our experiments show that \(|C_i|\) may be large, and so generating the UV-cell of \(O_i\) can still be costly. In Sect. 6, we show how to use \(C_i\) directly to construct an index for the UV-diagram. In the rest of this section, we present the proof of Theorem 5.

### 5.5 Proof of Theorem 5

Recall that \(O_G\) is formed by a set \(G\) of objects, using Step 1 of Algorithm 3. Let \(P_i(S)\) be a possible region of an object \(O_i\), constructed by using a set \(S \in O\) of objects. Essentially, \(P_i(S)\) is the intersection of the inside regions \(\overline{X_i(k)}\), where \(O_k \in S\). Let \(u_i\) be the uncertainty region of \(O_i\). We first claim the following.

**Lemma 4**

Given a set \(S\) of objects, where \(S \subseteq O\), for any objects \(O_i\) and \(O_k\), if \(u_i \subseteq u_k\), then \(P_i(S) \subseteq P_k(S)\).

**Lemma 5**

**Lemma 6**

Given two objects \(O_i\) and \(O_k\), where \(u_i \subseteq u_k\), and an object \(O_j\) where \(c_j \notin P_k(S)\), if \(P_k(S) = P_k(S) - X_k(j)\), then \(P_i(S) = P_i(S) - X_i(j)\).

As shown in Fig. 9, Lemma 6 claims that given an object \(O_j\) whose center is outside \(P_k(S)\), if the edge \(E_k(j)\) does not affect the possible region \(P_k(S)\), then \(E_i(j)\) cannot contribute to \(P_i(S)\).

*Proof*

*Proof of Theorem 5*Let \(V\) be the set of seeds used to construct possible region \(P_G(V)\) in Step 2 of Algorithm 3. If \(O_j \notin C_G\), the UV-edge \(E_G(j)\) does not cut \(P_G(V)\). In other words, \(P_G(V) = P_G(V) - X_G(j)\). The I-pruning and C-pruning methods used in Step 2 also guarantee that \(c_j\) is not inside \(P_G(V)\), that is, \(c_j \notin P_G(V)\). Moreover, \(u_i \subseteq u_G\). By substituting \(S=V\) and \(k=G\), we can deduce from Lemma 6 that \(P_i(V) = P_i(V) - X_i(j)\). Now there are two cases to consider:

*Case 1:*\(O_j\)*contributes an edge to*\(P_i(V)\). In other words, \(O_j \in V\). Since an object in \(V\) is not pruned by Step 2 of Algorithm 3, \(V \subseteq C_G\), and so \(O_j \in C_G\). However, this contradicts with the assumption that \(O_j \notin C_G\), and so this case cannot occur.*Case 2:*\(O_j\)*does not contribute an edge to*\(P_i(V)\). Since the UV-cell \(U_i\) of \(O_i\) must be inside \(P_i(V)\), \(O_j\) cannot contribute an edge to \(U_i\). Hence, \(O_j\) is not an r-object of \(O_i\), and the theorem holds.

## 6 The UV-index

We now present the *UV-index*, an approximate version of the UV-diagram. The UV-index can be efficiently computed and stored. It also facilitates efficient query evaluation. Section 6.1 gives an overview of its structure. In Sect. 6.2, we discuss how to use this index to support execution of different queries. We explain its construction process in Sect. 6.3.

### 6.1 Structure of the UV-index

^{4}Each non-leaf node, 16 bytes each, records a pointer to each of its four child nodes, where the square region spanned by each child node is one-fourth of that of its parent. The region covered by the root node is the whole domain \(D\). Each leaf node stores all the objects whose UV-cells overlap with the region defined for the node. To save space, a node’s region is not stored, since we can easily derive the dimension of the region based on the level of the node in the tree. Also, due to approximation, a UV-cell that does not overlap with the leaf node’s region may be included. However, a UV-cell that truly overlaps with the region will not be excluded. For each leaf node \(l\), we store a linked list of disk pages, which contain tuples \(<\)

*ID, MBC, pointer*\(>\), where:

*ID*is the identity of object \(O_i\) whose UV-cell may overlap with the region covered by \(l\);*MBC*is the circle that minimally bounds the uncertainty region of \(O_i\); and*pointer*stores the disk page address of the object.

We assume that all non-leaf nodes are stored in the main memory, and allocate a maximum number of \(M\) non-leaf nodes. The leaf nodes, which contain the lists of pages, are stored in the disk. Hence, \(M\) controls the amount of main memory to be used to implement the index. Next, we study how to use it to support query evaluation.

### 6.2 Using the UV-index

We now explain how to use the UV-index to support the queries that we described in Sect. 3.2.

*1. The PNN Query.* To find the probabilistic nearest neighbors of \(q\), we first locate the leaf node \(l\), whose region contains \(q\). This can be done easily by finding the grid that contains \(q\) in each index level, and traversing the index. We then retrieve the disk pages associated with \(l\), which contains the *ID* and the *MBC* values of the objects stored in these pages. Since these objects may have their UV-cells overlap with the region of \(l\), it is possible that \(q\) is contained in their UV-cells. Let \(L\) be the set of objects associated with \(l\), and \(A\) be the answer objects of \(q\). To answer a PNN, we need to retrieve \(A\) from \(L\), where \(A \subseteq L\). We use the method described in [15]: from the set of the *MBC*’s of the objects in \(L\), find \(d_{minmax}\), the minimum of the maximum distances of these objects from \(q\). Any object with the minimum distance larger than \(d_{minmax}\) is removed, since it cannot have a nonzero qualification probability. For objects that are not filtered, their probabilities are computed and returned as answers.

*2. The CPNN Query* maintains the PNN answers for a “moving” query point, whose location is periodically reported to the server. Let \(q_0\) be the latest position of \(q\) received by the server. Let \(g_0\) be a leaf node in the UV-index, whose region \(r_0\) contains \(q_0\). We assume that the objects stored in the disk pages associated with \(g_0\) are known. Now, suppose the new location of \(q\), say, \(q_1\), is received by the server. A straightforward solution is to treat \(q_1\) as a new PNN query, and use the PNN algorithm described above to compute the answers of \(q_1\). A better way is to check whether \(q_1\) is inside \(r_0\). If this is true, we simply use the object set associated with \(g_0\) to compute the answer for \(q_1\). This saves the effort of traversing the UV-index for \(q_1\).

*3. The UV-Partition Query.* We append a counter to each leaf node, and record the number of objects at that node. This process could be done after the UV-index is constructed. Then, a range query with range \(R\) is issued over the index, in order to find the leaf nodes whose regions overlap with \(R\). For every leaf node whose region \(r\) overlaps with \(R\), we compute its density, which is equal to the number of objects associated with \(r\), divided by the area of \(r\). The query then outputs \(r\) and its density value.

*4. The UV-cell Query.* Notice that if an object \(O_i\) appears in a leaf node \(g\), its UV-cell overlaps with the region of \(g\). Hence, we can return the approximate area and the extent of \(O_i\)’s UV-cell by scanning the leaf nodes associated with \(O_i\), and then summing up the total area of the regions covered by these nodes. This step can be improved by precomputing and storing the area information. For example, we can scan all the leaf-nodes once, and generate a table for each \(O_i\) with its respective areas. A similar procedure can be used to support the operation of displaying the approximate shape of the UV-cell on the user’s screen.

### 6.3 Construction of the UV-index

As discussed in Sect. 5, a UV-cell can be represented by a set of cr-objects, \(C_i\). We now examine how this facilitates the construction of the UV-index.

*Framework.* Let \(g\) be the grid node being examined, and \(h_k\) (where \(k=1,\ldots ,4\)) be the four child nodes of \(g\). We define a variable *nonleafnum*, which indicates the number of non-leaf nodes allocated to the index and has an initial value of 1. Originally, the root of the grid is a leaf node, whose region covered (root.*region*) is the domain \(D\).

- 1.
NORMAL (Steps 9–11): \(g\)’s pages still have space left, and so (\(i\), \(MBC_i\), \(ptr(O_i)\)) is inserted to \(g\)’s page, where \(ptr(O_i)\) is the pointer to \(O_i\)’s uncertainty region and pdf.

- 2.
OVERFLOW (Steps 12–15): \(g\)’s pages are full, and a new disk page has to be associated with \(g\), before the information about \(O_i\) is inserted to the new page.

- 3.
SPLIT (Steps 16–22): \(g\)’s pages are full. The page list \(g\) is removed. Then, \(g\) becomes the parent of four nodes (\(h_k\)), which have been previously generated by CheckSplit. The region of each child node \(h_k\) covers each of the four quarters of the region defined for \(g\). Also,

*nonleafnum*is incremented by a value of 1. Essentially, The information about the UV-cells previously associated with \(g\) are now represented by its child nodes, and \(g\) becomes a non-leaf node.

*Decision on Splitting.*When \(g\)’s pages are full, either \(O_i\)’s information is inserted to a new page (OVERFLOW) or split into four child nodes (SPLIT). Ideally, the region of the leaf node that covers \(q\) is completely covered by a true UV-partition. This guarantees that the set of objects returned by the UV-index is the true answer objects. The UV-index, which contains grids, is just an approximation of the UV-diagram. Apparently, the more the splitting is performed, the closer the index can resemble the actual UV-diagram, and yield better query performance.

In fact, splitting is not always useful. Suppose that \(g\).*region* is associated with 100 UV-cells. Moreover, \(g\).*region* is *completely* covered by each of these UV-cells. Then, it is not necessary to redistribute \(g\) into four child nodes. If splitting is performed in this case, then the UV-cells associated with each child node are exactly the same.

*split fraction*, \(\theta \), as follows:

*region*is small compared with that of \(g\). We now define a splitting condition of a node:

*split threshold*. A larger value of \(T_\theta \) implies a higher tendency of splitting.

*region*(Steps 11 and 12). If the splitting condition is satisfied (Step 17), then the SPLIT decision is returned, where Algorithm 4 (Steps 18 and 19) will assign the nodes \(h_k\) to be the child nodes of \(g\). Otherwise, the child nodes are deleted and an OVERFLOW decision is made (Steps 20 and 21).

*Overlap Checking.*Algorithm 6 tests whether the UV-cell of an object \(O_i\) overlaps with a grid \(g\)’s region \(r\). For every cr-object \(O_k \in C_i\), if any of their corresponding outside region (\(X_i(k)\)) totally contains \(r\), then CheckOverlap returns false (Steps 1–3). Otherwise, true is returned (Step 6). To prove the correctness, we use the following lemma:

**Lemma 7**

If region \(r\) is totally covered by \(X_i(k)\), where \(O_k \in C_i\), then \(r\) must not overlap with the UV-cell \(U_i\).

*Proof*

To check whether a region \(r\) is contained in \(X_i(j)\) (Step 2), a simple way is to generate and test with the UV-edge \(E_i(j)\). This can be avoided, by carrying out a simple *4-point test*. Observe that \(r\) is a square, and the UV-edge of \(O_i\) with respect to \(O_j\) is concave in shape. If all its four corner points are confirmed to be in \(X_i(j)\), we conclude that \(r \subseteq X_i(j)\). Figure 11b shows that the region of \(g_1\) must not overlap with \(U_i\), since all the four corners of \(g\) are located on the outside region of one of the UV-edges. Checking whether a point is in \(X_i(j)\) is easy, because we can simply check whether the point’s minimum distance from \(O_i\) is larger than its maximum distance from \(O_j\). We thus use the four-point test in Step 2.

Notice that Algorithm 6 may incorrectly judge that \(U_i\) overlaps with \(r\). Figure 11(b) shows that \(U_i\) does not overlap with the region of grid \(g_2\). However, some corners of \(g_2\).*region* are not contained in the outside regions of two of the UV-edges of \(U_i\). If this is true for *all* UV-edges of \(U_i\), then \(U_i\) would be decided to be associated with \(g_2\)! If this happens, then during query evaluation, \(O_i\) will be retrieved from \(g_2\). This increases the execution time since \(O_i\) is not in \(g_2\). However, query accuracy is not affected, since we can still detect that \(O_i\) is not a nearest neighbor of \(q\). In our experiments, this situation is rare, and does not have a significant effect on query evaluation time.

Since \(|C_i|=O(n)\), Algorithm 6 needs \(O(n)\) time to complete. Algorithm 5 uses \(O(n^2)\) time, mainly for performing splitting and overlap checking with four child nodes. For Algorithm 4, each UV-cell, in the worst case, needs to perform overlap and split tests with \(M\) non-leaf nodes. Hence, its total time complexity is \(O(Mn^2)\). The index has a maximum height of \(M/4\), if, the data distribution is very skewed, and splitting always happens in one single quadrant. However, all non-leaf nodes, 16-byte long, can all be stored in the main memory. Thus, the tree height has little effect on query performance.

## 7 Results

We now report the results. Section 7.1 describes the experiment settings. In Sect. 7.2, we discuss the results about query performance. Section 7.4 presents the results about UV-index construction.

### 7.1 Setup

We use both synthetic and real datasets in our experiments. For synthetic data, we use Theodoridis et al’s data generator^{5} to obtain 20, 40, 60, 80, and 100K objects, which are uniformly distributed in a 10K \(\times \) 10K space. Each object has a circular uncertainty region with a diameter of 40 units, and a Gaussian uncertainty pdf. For each uncertainty pdf, its mean is the center of the circle, and its variance is the square of one-sixth of the uncertainty region’s diameter. We represent an uncertainty pdf as 16 histogram bars, where a histogram bar records the probability that the object is in that area. We also use three real datasets of geographical objects in Germany, namely *utility*, *roads*, and *rrlines*, with respective sizes 17, 30, and 36K. We also test the *Long Beach* (or *LB*) dataset, which contains 53K objects.^{6} These objects are represented as circles before indexing, and has the same uncertainty pdf information as that of the synthetic data.

To compare with R-tree, we use a packed R*-tree [30] to index uncertain objects. The R-tree uses 4K-byte disk pages, and has a fanout of 100. We keep all its non-leaf nodes in the main memory.

For the UV-index, each non-leaf node has four 4-byte pointers to its children. We set \(M\), the number of non-leaf nodes in the main memory, to be 10,000. The leaf nodes of both indexes, as well as the uncertainty information about the objects, are stored in the disk.

For \(T_\theta \), the splitting threshold used in constructing the UV-index, we have performed a sensitivity test. Under a wide range of \(T_\theta \), the indexes only have a slight performance difference. For very small values of \(T_\theta \) (e.g., 0.2), however, the adaptive grid tends not to split, and degrades into long linked lists of pages. Here, we set \(T_\theta \) to be 1.We wrote the programs in C++ and tested them on a Core2 Duo 2.66 GHz PC.

### 7.2 Results on query evaluation

We first study the performance of the queries studied in Sect. 3.2. We assume that the *LP* algorithm, presented in Sect. 5.4, is used to generate the UV-index. However, as we will discuss later, the different UV-index construction methods described here has little effect on query performance.

*1. The PNN Query.*We first compare the PNN performance of the UV-index and the R-tree. We present the average results of 50 query points randomly selected in the data domain. We use the numerical integration method of [15] to implement the probability computation of answer objects.

^{7}Figure 12a shows the query running time (\(T_q\)) for different synthetic datasets, with the number of objects ranging from 20 to 100K. The running times of both queries increase, because with a larger dataset, potentially more objects qualify as query answers, and this increases the time for index retrieval and probability computation. Our method outperforms R-tree in all cases. For example, when \(|O|=60\)K, the UV-index needs about 50 % of the time needed by the R-tree.

To understand why our method performs better, let us examine the traversal time of the UV-index, which is composed of the time costs for visiting non-leaf and leaf nodes. Since its non-leaf traversal time takes little time in all experiments (up to 3.9 \(\mu \)s), we only present the I/O overhead. In Fig. 12b, we compare the I/O performance of the UV-index and the R-tree. The UV-index requires significantly less number of I/Os than the R-tree (e.g., when \(|O|=60\)K, the UV-index consumes about one-fifth of the I/Os needed by the R-tree). When the R-tree is used to process a PNN query, plenty of leaf nodes needed to be retrieved. For the UV-index, we only need to look for the leaf node that contains the query point. Since the number of disk pages for each leaf node is also small, a high I/O performance can be attained. Also notice that the number of I/Os for the R-tree increases with \(|O|\), whereas that of the UV-index is relatively stable.

Results on real datasets

Dataset | \(|O|\) (K) | \(T_q\) (UV) (ms) | \(T_q\) (R-tree) (ms) | \(T_c\) (s) | \(p_c\) (%) |
---|---|---|---|---|---|

utility | 17 | 89 | 141 | 569 | 97.45 |

roads | 30 | 82 | 135 | 1,195 | 97.80 |

rrlines | 36 | 107 | 159 | 1,340 | 98.30 |

LB | 53 | 109 | 173 | 1,579 | 98.22 |

*2. The UV-Partition and the UV-cell Queries.* We now examine the efficiency of our index for answering the UV-partition query on our synthetic dataset. For each size of a query region \(R\), 50 queries are generated, whose centers of \(R\) are uniformly distributed in the data domain. We can see from Fig. 12e that the retrieval time of UV-partitions (\(T_q\)) increases with the size of \(R\), since more UV-partitions are loaded when \(R\) becomes larger. The increase is almost linear, and the query evaluation time is less than 160 ms. We have also examined the performance of the UV-cell queries on the default synthetic dataset. On average, the time for obtaining a UV-cell from the UV-index is 58.46 ms, or equivalently, 4.62 I/Os. Thus, running a UV-cell query costs little time in our experiments.

*3. The CPNN query.* To generate a CPNN query, we use the CanuMobiSim simulator,^{8} which produces a moving-point trajectory. The movement of a query point follows a random walk model, as detailed in [34]. The location of a query point, which changes at a maximum speed of 100 units per second, is reported every second. The default “trajectory length” of a query is 60, that is, each query has 60 location reports. In our experiments, each data point is the average of 50 queries.

*UV-index-n*, is a naïve application of the UV-index: each time a query point is received, the UV-index is consulted once. The second one, called

*UV-index-e*, is the enhanced version of

*UV-index-n*, where the UV-index is only consulted if the current query point is not located in the same grid as the previous one (Sect. 6.2). Figure 13a shows the evaluation time of a query over synthetic data of different sizes. As we can see, the query performance of the UV-index is at least 25 % times better than the R-tree. The reason can be explained by Fig. 13b, which shows the number of I/Os required by these methods. We observe that the I/O cost of issuing a CPNN on the UV-index is much lower than that of the R-tree. For example, when \(|O|=60k\), the query cost of the UV-index algorithms is about 30 % of the R-tree. We also see that

*UV-index-e*performs better than

*UV-index-n*. When the current query point \(q_1\) is located in the grid \(g\) that also contains the previous query point \(q_0\),

*UV-index-e*uses the objects associated with \(g\) to answer the PNN at \(q_1\). Thus, the effort of traversing the UV-index for \(q_1\) can be saved. This saving is quite significant; at \(|O|=60k\), for instance, the number of I/Os required by

*UV-index-e*is only 66 % of that of

*UV-index-n*. In Fig. 13c, we examine the effect of the query trajectory length. Again, the

*UV-index-e*performs the best among the three access methods.

### 7.3 Storage cost analysis

### 7.4 Results on UV-Index Construction

*Basic*: a UV-cell is derived using Algorithm 1, which is then used to build the UV-index;*ICR (I- and C-pruning with Refinement)*: collect cr-objects through I- and C-pruning (Algorithm 2), compute UV-cells and obtain the r-objects, then index them with Algorithm 4.*IC (I- and C-pruning)*: the cr-objects obtained through I- and C-pruning are used directly to construct the UV-index by Algorithm 4.

*IC*and

*ICR*), we set

*k*to 300 for performing the

*k*-NN search. Then, the domain \(D\) is divided into \(k_s=30\) sectors to obtain the seeds.

*Basic*increases sharply with the dataset size; handling a \(40K\) dataset requires about 60 h. This is because constructing a UV-cell requires an exponential amount of time and numerous complex hyperbola intersections. For

*ICR*and

*IC*, the use of I- and C-pruning significantly reduces the number of objects examined. Their effects are shown in Figure 15(b), where \(p_c\), the pruning ratio, denotes the fraction of objects from \(O\) that has been filtered. At \(|O|\)=60k, I-pruning and C-pruning achieve a pruning ratio of 98.9 and 99.5 % respectively. Hence, a large portion of objects are removed before being considered for constructing the UV-cell. Next, we examine

*ICR*and

*IC*.

*IC versus ICR.* As shown in Fig. 15c, *IC* performs much better than *ICR*. For example, at \(|O|=80\)K, the construction time of *IC* is about 10 % of that of *ICR*. To understand why, we analyze their time components in Fig. 15d, e. Recall the difference between the two methods is that *ICR* needs to find out the exact r-objects (by constructing an exact UV-cell based on the objects returned by pruning), while *IC* does not. For *ICR*, Fig. 15d shows the fraction of the construction time spent on: (1) seeds selection, (2) initial possible region computation, (3) I- and C-pruning, (4) generating r-objects, and (5) indexing UV-cells. For most datasets, *ICR* spends most of the time to generate exact r-objects, which is very costly. For *IC*, r-object is not produced (Fig. 15e). Instead, the cr-objects produced by the pruning methods are immediately passed to Algorithm 4 for indexing. The number of cr-objects generated, while larger than that of r-objects, does not increase the indexing time significantly.

In Fig. 15f, the construction time of *ICR* increases sharply with the objects’ uncertainty region sizes. With larger uncertainty regions, it is more likely that these regions overlap with each other, making it harder to prune the objects, so that more time is needed to generate r-objects. On the other hand, *IC* is relatively insensitive to the change in uncertainty region sizes.

We have also measured the query times between the indexes created by *IC* and *ICR*. Figure 15g shows that the UV-index generated by the two methods is highly similar, resulting in a close query performance. The query cost of *ICR* is about 0.01 I/Os, or 0.13 ms, better than *IC*. In the sequel, we assume that *IC* is used.

*Model-based index construction.*In Sect. 5.2, we have demonstrated how to use the UV-cell model (Sect. 4.4) to facilitate seed selection for objects whose locations are uniformly distributed. We call the UV-index construction algorithm that employs this method as

*Model*, and the one that does not use it as

*Non-model*. We evaluate these two algorithms on our synthetic datasets. As we can see from Fig. 16a,

*Model*performs better than

*Non-model*in most cases. When \(|O|\)=80 k, about 20 % of the index construction time is saved. Figure 16b illustrates that

*Model*is consistently better than

*Non-model*under different uncertainty region sizes. For example, when the radius of an uncertainty region is 80, the time required by

*Model*is about half of that of \(IC\).

To understand why *Model* performs well, we compare the difference between the size \(S\) of a UV-cell estimated by our model, and its “true” size. Again, \(S\) is the length of the MBR that tightly bounds the estimated UV-cell. The true size of the UV-cell can be obtained by using Algorithm 1. Based on the vertices of this UV-cell, we obtain its minimum bounding rectangle (MBR). We use the larger length of the two dimensions of this MBR to represent the size of the UV-cell. Figure 16c shows the average size of a UV-cell under different uncertainty region sizes. The UV-cell size increases with the uncertainty region radius, since an object can be in more possible locations. This increases its chance to be a possible nearest neighbor of a query point. In this experiment, our method offers a reasonable estimation of the UV-cell’s size—the estimation error is between 4 and 12 %. This enables the selection of seeds, as well as the index construction algorithm, to be effective.

*Batch processing.*We next examine the performance of

*LP*, which derives cr-object based on groups of data objects (Sect. 5.4). We compare

*LP*with

*single*, which generates a cr-object set for each data object separately. We do not use model-based seed selection in these experiments. Figure 17a shows that

*LP*performs better than

*single*on our synthetic datasets. At \(|O|=80\)k, the time cost of

*LP*is about 60 % of that of

*single*. In

*LP*, the cr-object set generation cost is shared among a group of objects.

We also test the performance of *single* and *LP* on larger datasets. We use the same synthetic data generator to produce two datasets that contain 0.5M and 1M objects. The 1M dataset occupies 640Mbytes. The new result, illustrated in Fig. 17b, shows that the construction performance of both *single* and *LP* increases with the dataset size in a linear manner. For the 1M dataset, *LP* needs 7.7 h, which is 23 % faster than *single*.

Figure 17c shows that when *LP* is used, the seed selection time of *single* is shortened by more than 80 %. While *single* generates seeds for every object individually, in *LP*, the seeds of every object in set \(G\) are retrieved from a set of objects \(C_G\) (Step 2 of Algorithm 3). We can also see that the I- and C-pruning time required by *LP* is also less than *single*; when \(|O|=60k\), the improvement is over 60 %. In *single*, I-pruning is done for every object; in *LP*, I-pruning is only done once for every group. The performance gap is more profound when \(|O|\) is large, since the same domain is populated with more objects, resulting in more candidates retrieved after I-pruning.

We also examine the effect of the average uncertainty region size on the construction time. As discussed before, the larger this size, the more construction time will be needed. Figure 17d shows that *LP* is more stable than *single*. When the uncertainty region size is 60, *LP* needs more about 60 % time of *single*; when the size becomes 100, *LP* is 3.5 times faster than *single*. In Fig. 17e, we compare the query performance of the UV-indices generated by *single* and *LP*. We observe that the number of I/Os required by the two methods is the same. Their probability computation times, not shown here, are also very close. Hence, the query performance of two methods is almost the same.

Next, we compare the construction time of the R-tree and UV-index, using *single* and *LP*. Figure 17f shows that the construction cost of the R-tree is less than 1 % of that of the UV-index. Hence, the R-tree introduces little overhead to the UV-index construction process. However, it improves the performance of generating the UV-index. For instance, the I-pruning phase can be executed more efficiently with the use of the R-tree.

For real datasets, *LP* also outperforms *single* (Fig. 17g). In *rrline*, for example, *LP* needs one-third of the time required by *single*. *LP* also achieves a high pruning ratio, as shown in Table 2. This explains why *LP* requires less time to construct a UV-index, compared with *single*.

*Skewness.*In Fig. 18, we study the effect of data skewness, by varying the variance (\(\sigma \)) of the objects’ mean positions. We can see that when the data are more skewed (i.e., with a smaller variance), the construction time is higher, because in a dense area where uncertainty regions have high degree of overlap, an object’s UV-cell is likely small and associated with many r-objects. The

*LP*algorithm is still more efficient than

*single*. In the most skewed dataset that we tested (\(\sigma =\)1,500),

*LP*is 33.3 % faster than

*single*.

Finally, we examine how a skewed distribution of the centers of uncertainty regions can affect our results. We obtain a 60k dataset that follows the zipfian distribution, by using the same generator that produces our uniformly distributed dataset. For the zipfian distribution, the average query I/O costs for *IC* and *ICR* are 2.48 and 2.41. Thus, the query performance of ICR is 0.07 I/Os (or 2.8 %) better than IC. Since their time difference is small (around 0.4 ms), we use *IC* in the rest of the experiments.

*LP*) technique. Observe that the construction time of the zipfian distribution is worse than the uniform distribution. In a skewed dataset, a UV-cell in a very dense area can be determined by many r-objects, and this renders lower pruning efficiency in the construction phase. However, there is only a slight query I/O difference between the two distributions, and the query performance for both distributions is almost the same.

Results on zipfian distribution

\(|O|=60k\) | Uniform | Zipfian | |
---|---|---|---|

\(LP\) | \(Single\) | \(LP\) | |

\(T_c\)(hours) | 0.45 | 5.78 | 2.46 |

\(T_q(I/Os)\) | 2.00 | 2.48 | 2.45 |

In the same table, we study the difference between *single* and *LP* for zipfian distribution. Notice that *LP* requires about 42 % of time needed by *single*. This means that our batch processing method improves the construction performance for zipfian distribution significantly. The query performance of the UV-index constructed by *LP* is also slightly better (0.03 I/Os) than *single*.

## 8 Conclusions

The UV-diagram is a variant of the Voronoi diagram designed for uncertain data. To tackle the complexity of constructing and evaluating a UV-diagram, we introduce the concept of UV-cells and cr-objects. We study the theoretical size of a UV-cell. We propose an adaptive index for the UV-diagram, and develop efficient algorithms for building it. We also present a batch processing algorithm to further reduce the UV-index construction time. Our experiments show that this index efficiently supports PNNs and other UV-diagram-related queries.

We plan to study the use of the UV-diagram to support other variants of probabilistic NNQs, for example, approximate NNQs [12, 13]; monochromatic and bichromatic reverse-nearest-neighbor (RNN) queries [10, 27, 42]; and \(k\)-RNN queries [11]. Another interesting problem is to design a UV-diagram such that whenever a query point is located in a UV-cell \(U_i\), we can know that the qualification probability of \(O_i\) is larger than some threshold \(T\). By using this variant of UV-diagram, we can get all the objects with qualification probability larger than \(T\), without computing their actual probabilities. This could be beneficial to queries where a user is only interested in answers with qualification probabilities larger than \(T\). It is also interesting to examine how the UV-diagram can support multi-dimensional data and incremental updates.

The centers of uncertainty regions form the vertices of \(n\) hexagons, each of which has an area of \(\frac{\sqrt{3}d_0^2}{2}\). Since \(|D| = n \times \frac{\sqrt{3}d_0^2}{2}\), \(d_0 = \sqrt{\frac{2|D|}{\sqrt{3}n}}\).

As shown in Fig. 2, a UV-cell can be irregular in shape, and so estimating its size is not easy. Thus, we use a simple data model here. We will also explain how these results can be applied to uniformly distributed data, in Sect. 5.2.

We do not execute iPrune\((P_i,O)\) after Step 4 because the set of objects returned by iPrune is often the superset of \(C_G\) in our experiments. Thus, iPrune is not very effective here.

We adopt the quad-tree rather than the R-tree. While R-tree MBRs may overlap, quad-tree grids do not. Issuing a point query on non-overlapping UV-partitions in quad-tree is thus more convenient than R-tree.

## Acknowledgments

Reynold Cheng, Xike Xie, Liwen Sun, and Jinchuan Chen were supported by the Research Grants Council of Hong Kong (GRF Projects 711110, 711309E, 513508). We would like to thank the anonymous reviewers for their insightful comments.

### Open Access

This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.