UVdiagram: a voronoi diagram for uncertain spatial databases
Authors
 First Online:
 Received:
 Revised:
 Accepted:
DOI: 10.1007/s007780120290x
 Cite this article as:
 Xie, X., Cheng, R., Yiu, M.L. et al. The VLDB Journal (2013) 22: 319. doi:10.1007/s007780120290x
Abstract
The Voronoi diagram is an important technique for answering nearestneighbor queries for spatial databases. We study how the Voronoi diagram can be used for uncertain spatial data, which are inherent in scientific and business applications. Specifically, we propose the UncertainVoronoi diagram (or UVdiagram), which divides the data space into disjoint “UVpartitions”. Each UVpartition \(P\) is associated with a set \(S\) of objects, such that any point \(q\) located in \(P\) has the set \(S\) as its nearest neighbor with nonzero probabilities. The UVdiagram enables queries that return objects with nonzero chances of being the nearest neighbor (NN) of a given point \(q\). It supports “continuous nearestneighbor search”, which refreshes the set of NN objects of \(q\), as the position of \(q\) changes. It also allows the analysis of nearestneighbor information, for example, to find out the number of objects that are the nearest neighbors of any point in a given area. A UVdiagram requires exponential construction and storage costs. To tackle these problems, we devise an alternative representation of a UVdiagram, by using a set of UVcells. A UVcell of an object \(o\) is the extent \(e\) for which \(o\) can be the nearest neighbor of any point \(q \in e\). We study how to speed up the derivation of UVcells by considering its nearby objects. We also use the UVcells to design the UVindex, which supports different queries, and can be constructed in polynomial time. We have performed extensive experiments on both real and synthetic data to validate the efficiency of our approaches.
Keywords
Voronoi diagram Uncertain data Nearestneighbor query1 Introduction
Is it possible to use the Voronoi diagram to perform nearestneighbor search on objects whose values are imprecise? Data values can be uncertain for a variety of reasons. Consider a satellite image, which depicts geographical objects like airports, vehicles, and people. Using machine learning and human effort (e.g., communitybased systems like Wikimapia), the location of each object on the image can be obtained. Due to the noisy transmission of satellite data, the quality of these images can be affected, and we may not be able to obtain very accurate locations. Moreover, if this location information is released to the public (e.g, for research purposes), it may need to be preprocessed for privacy purposes. In fact, recent proposals like [1, 2] have suggested to represent a user’s position as a larger region, in order to lower the likelihood that a user is identified at a particular site. Uncertainty is also inherent in biological data management. For example, microscopy images have been actively used to analyze the thickness of neuron layers in the retina, as well as the extent of the area of a cell. Due to factors like image resolution and measurement accuracy, it is hard to obtain exact values of the objects of interest [28, 29]. For this kind of data, techniques for evaluating range queries, nearestneighbor queries, and joins have been developed. These queries return answers with probabilistic guarantees, which reflect the confidence of answers due to data uncertainty. For these applications, tools that resemble the Voronoi diagram can be potentially useful. Specifically, we would like to examine spacepartitioning techniques for performing a Probabilistic NearestNeighbor Query (PNN). Given a query point \(q\), a PNN returns the IDs of objects with nonzero probabilities for being the closest to \(q\), as well as their probabilities. In the sequel, we denote the objects returned by the PNN as answer objects, and their probability values as qualification probabilities.
An uncertainty model that has been commonly used is to assume that an object \(O_i\) has an “uncertainty region” and a probability distribution function (pdf). This means that the precise position of \(O_i\) can only be located inside the (closed) region, with a pdf that describes the distribution of the object’s position within the region. The uncertainty region can have any shape, and the pdf is arbitrary (e.g., it can be a uniform distribution, Gaussian, or a histogram). Here, we assume that \(O_i\) has a twodimensional circular uncertainty region. We will also explain how our solution can be extended to handle noncircularshaped regions. We examine how the Voronoi diagram should be defined to support PNN execution. Specifically, we propose the UncertainVoronoi diagram (or UVdiagram), where the nearestneighbor information of every point in the data space is recorded, based on the uncertain objects involved. The UVdiagram provides a basis for studying solutions that used the Voronoi diagram for point data. It could be interesting, for instance, to extend the solution of [44] to support uncertain data in broadcasting services. Figure 1b illustrates an example of the UVdiagram of seven uncertain objects, where the space is divided into disjoint regions called UVpartitions. Each UVpartition \(P\) is associated with a set \(S\) of one or more objects. For any point \(q\) located inside \(P\), \(S\) is the set of answer objects of \(q\) (i.e., each object in \(S\) has a nonzero probability for being the nearest neighbor of \(q\)). The highlighted regions contain points that have two or more nearestneighbor objects. As an example, since \(q_1\) is inside the dashed region, \(O_4\) has a nonzero probability for being the nearest neighbor of \(q_1\); on the other hand, \(q_2\) is located inside the dotted region, and \(O_6\) and \(O_7\) are the answer objects for the PNN with \(q_2\) as the query point. Observe that the Voronoi diagram, which indexes on spatial points, is a special case of the UVdiagram, since a point can be viewed as an uncertainty region with a zero radius. Figure 1 compares the two diagrams.
The Voronoi diagram can also be used in other applications. For example, a continuous nearestneighbor query, which constantly returns the nearest neighbor (e.g., gas station) of a moving point \(q\) (e.g., a vehicle), is a typical operation in locationbased services [32, 43]. The Voronoi diagram supports this query; particularly, the Voronoi cell that contains the current location of \(q\) can be easily retrieved. We will illustrate how to use the UVdiagram to track the possible nearest neighbors of a moving point. Another use of the Voronoi diagram is to perform data analysis or observe interesting patterns of nearestneighbor information. In [41], the Voronoi diagram is used to investigate the spreading pattern of bluetooth viruses among mobile users. We can also use UVdiagram to provide valuable information about these “nearestneighbor patterns”. In Fig. 1b, if the dashed region is large, it indicates that \(O_4\) has high chance to be placed in different clusters (assuming that a nearestneighbor clustering algorithm is used). Another interesting query is as follows: given a region \(R\), display all UVpartitions that intersect with \(R\), as well as the density of objects that can be the nearest neighbor in each UVpartition. Hence, a UVdiagram allows a user to visualize patterns about the nearestneighbor information.
Our solutions. Instead of computing UVpartitions, we have developed an alternative interpretation of the UVdiagram. For every object \(O_i\), we consider the extent \(a_i\) such that \(O_i\) can be the nearest neighbor of any point selected from \(a_i\). We call this extent the UVcell of \(O_i\). We examine some basic properties of a UVcell (e.g., its size and number of edges). We show how to represent a UVcell as a set of objects, and develop novel methods to find this object set efficiently. For example, our batchconstruction algorithm allows the UVcells of objects that are physically close to each other to be swiftly obtained. We propose a polynomial time method for constructing an index for the UVpartitions, called the UVindex. We adopt an adaptivegrid indexing scheme, which has the advantage of adapting to different distributions of uncertain objects’ positions. Our experimental results show that on both synthetic and real dataset, this index can be constructed in a much shorter time. We also explain how to use the UVindex to support different applications (e.g., PNN and nearestneighbor pattern queries).

Study the UVdiagram and its basic properties;

Propose efficient algorithms for obtaining a UVcell;

Design the UVindex;

Use the UVindex to support different queries; and

Conduct experiments on real and synthetic datasets.
2 Related work
Data uncertainty management. Recently, researchers have proposed to consider uncertainty as a “firstclass citizen” in a DBMS [13, 14, 18, 39]. Two models can be used to represent uncertain data: tuple and attribute uncertainty. For tuple uncertainty, each database tuple has a probability of being correct [18]. Here, we assume attribute uncertainty, which represents an attribute as a range of possible values and a probability distribution function (pdf) bounded in the range [39]. Common queries for attribute uncertainty include range queries [16], \(k\)nearestneighbors [28], skylines [25, 36], and top\(k\) queries [20].
A few works have been proposed to evaluate PNN queries over attribute uncertainty. In [15], numerical integration techniques have been presented. Probabilistic verifiers, described in [13], can generate answer objects’ probability bounds without performing expensive integration operations. Another way to compute answer probabilities is based on sampling [24]. In this paper, we focus on the efficient retrieval of answer objects.
To our understanding, the only indexing method available for nearestneighbor search over uncertain data is to use an index like the Rtree and the grid. The Rtree is a diskbased structure that uses the minimum bounding rectangles (MBRs in short) to cluster the uncertainty regions of the objects, and organizes MBRs in a hierarchical manner [6]. To evaluate PNN using the Rtree, a branchandprune strategy has been proposed in [15], where MBRs that may contain answer objects are traversed. However, this involves a lot of I/O cost in reading index nodes and leaf pages [13, 15]. Similar issues also occur with grids [31]. On the other hand, retrieving answer objects from the UVdiagram is essentially a point query search: given a point \(q\), find the objects associated with the UVpartition that contains \(q\). Hence, a UVdiagram can support more efficient PNN search. While it is not clear how an Rtree or grid over uncertain objects can provide pattern analysis of nearestneighbor information (e.g., displaying the extent of a UVpartition), we will show how to use the UVdiagram to provide this information.
Other types of nearestneighbor queries, like the “group nearestneighbors” [26], “reversenearestneighbors” [10, 27], “uncertain queries” [8], and “continuous nearestneighbor queries” [12] have also been proposed. In these works, the Rtree was used to support object retrieval. It is interesting to study how the UVdiagram can be used to support the execution of these queries. In this paper, we study how to use the UVdiagram to support the execution of continuous nearestneighbor queries.
The Voronoi diagram is an important technique for answering nearestneighbor queries over spatial points [33]. It has been extended to support other applications (e.g., [7, 32, 43–45]). It also facilitates the analysis of spreading patterns of mobile viruses [41]. In [9], the \(k\)th order Voronoi diagram is used to evaluate a \(k\)NN query. In [38], an index called VoRTree is designed to merge Voronoi diagrams into Rtree in order to answer various nearestneighbor queries. The Voronoi diagram has also been defined for boundaries of circular objects in [23]. However, these objects are not uncertain, and the method of [23] cannot be used to answer PNN queries.
Few works have studied the application of the Voronoi diagram on uncertain data. [8] Consider the “uncertain” nearest neighbor query (UNN) over spatial points. Different from PNN, the query is an uncertain region, not a query point. To evaluate a UNN, the authors propose to use a Voronoi diagram over 2D points. The portions of the Voronoi cells that overlap with the query’s uncertainty region are then used to compute answer probabilities. [22] Consider the clustering of uncertain attribute data, where a Voronoi diagram is constructed for centroid points. Notice that [8, 22] do not construct a Voronoi diagram for uncertain data. On the other hand, the UVdiagram is a Voronoi diagram tailored for attribute uncertainty.
In [21, 37], the Voronoi diagram was modified to identify an imprecise object that is surely the nearest object of a query point \(q\). However, the UVdiagram returns object(s) that may have chance to be the nearest neighbor of \(q\), and can be used to answer probabilistic nearestneighbor queries. We also study a database index for the UVdiagram, which has not been examined in these two works.
This paper is an extension of [17]. Here, we improve the performance of UVindex construction by proposing batch pruning, which reduces the workload of generating UVcells for a set of nearby objects. We provide a more detailed study of the basic properties of a UVcell (e.g., its size and number of edges). We also examine how the UVindex can be used to answer PNN queries for a moving query point. We conduct new experiments to validate the effectiveness of these approaches.
3 The UVdiagram
We now present the basic notions of the UVdiagram.
We introduce the UVcell, an alternative presentation of the UVdiagram, in Sect. 3.1. We then study some applications of the UVdiagram, in Sect. 3.2.
3.1 The UVcell
As discussed before, the UVdiagram can be expensive to construct. We hence propose an alternative representation of the UVdiagram, by using UVcells. We will later explain how the UVcells facilitate efficient construction of the UVdiagram. Now, let \(O_1,O_2,\ldots ,O_n\) be the IDs of a set \(O\) of uncertain objects, and \(D\) be a twodimensional space that contains these objects. For simplicity, we assume that \(D\) is a square. The UVcell is then defined as follows:
Definition 1
A UVcell of \(O_i\), denoted by \(U_i\), is a region in \(D\) such that \(O_i\) has a nonzero probability to be the nearest neighbor (NN) of a point \(q\), where \(q \in U_i\).
Figure 2 illustrates the UVcells for \(O_1\), \(O_2\), and \(O_3\). The boundary of each UVcell is labeled with the ID of the object. For example, the UVcell of \(O_2\) is a region enclosed by solidline segments.
The UVcell can be used to recover the UVpartitions (i.e., disjoint regions of a UVdiagram). In fact, a UVpartition that contains \(q\) is the intersection of all UVcells that contain \(q\). This is because the objects associated with these UVcells have nonzero qualification probabilities for \(q\). For instance, in Fig. 2, the UVcells of both \(O_1\) and \(O_3\) intersect at partitions \(R_5\) and \(R_7\). This means that when \(q\) is located at any of these partitions, both \(O_1\) and \(O_3\) are the answer objects. Since \(R_7\) is intersected by \(O_2\)’s UVcell, \(O_2\) is also associated with \(R_7\). Therefore, a UVdiagram is the union of all objects’ UVcells. Besides, the UVcells of all objects can be used to output which object(s) is/are the nearest neighbor of \(q\) with nonzero probabilities.
Notations and meanings
Notation 
Meaning 

Objects and query  
\(D\) 
Domain space (a square) 
\(O\) 
A set of uncertain objects (\(O_1,O_2,\ldots ,O_n\)) 
\(MBC(O_{i})\) 
Minimum bounding circle of object \(O\) 
\((c_i,r_i)\) 
Center and radius of \(O_i\)’s uncertainty region 
\(q\) 
Query point of a PNN 
\(\rho \) 
Density of objects in \(D\) 
UVdiagram  
\(\odot (c,r)\) 
A circle centered at \(c\) with radius \(r\) 
\(dist(q,c_i)\) 
Euclidean distance between \(q\) and \(c_i\) 
\(dist_{min}(q,O_i)\) 
Minimum distance of \(O_i\) from \(q\) 
\(dist_{max}(q,O_i)\) 
Maximum distance of \(O_i\) from \(q\) 
\(U_i\) 
UVcell of \(O_i\) 
\(P_i\) 
Possible region of \(O_i\) 
\(E_i(j)\) 
UVedge of \(O_i\) w.r.t. \(O_j\) 
\(X_i(j)~~(~\overline{X_i(j)}~)\) 
Outside (inside) region of \(O_i\) w.r.t. \(O_j\) 
\(F_i\) 
rObjects of \(O_i\), where \(F_i \subseteq O\) 
\(C_i\) 
crObjects of \(O_i\), where \(C_i \subseteq O\) 
\(M\) 
Maximum no. of nonleaf nodes 
\(s\) 
Estimated size of a UVcell 
\(T_\theta \) 
Split threshold 
3.2 Applications of the UVdiagram
The UVdiagram supports a number of applications. Let us now explain how to use the UVdiagram to handle the following queries:
1. The Probabilistic Nearestneighbor (PNN) Query. This query has been mentioned in Sect. 1. To evaluate a PNN for a given point \(q\), we can find out the UVpartition that contains \(q\). The set \(A\) of objects associated with this partition are those that can be the nearest neighbor of \(q\). Notice that the UVpartitions can be obtained by finding the union of all the UVcells. For each object \(O_i \in A\), the probability that \(O_i\) is the closest to \(q\) can be efficiently evaluated by using solutions in [13, 15, 24].
2. The Continuous PNN Search (CPNN), an extension of the PNN, is a query that resides in the processing server for an extensive period of time. Different from PNN, the position of a query point \(q\) changes with time [12]. The objective of the CPNN is to refresh the PNN answer, when the value of \(q\) changes. This query can be used in transportation services, where \(q\) can be a moving vehicle or person, and the data can be the geographical objects retrieved from satellite images. Assuming that \(q\) reports its position to the server periodically, the UVdiagram can conveniently support CPNN. Suppose that the server receives a new position of \(q\), say, \(q_1\). A simple solution is to issue a new PNN for \(q_1\). However, if \(q_1\) is located in the same UVpartition that contains the old position of \(q\), then it suffices to use the objects associated with that UVpartition to compute the query answer for \(q_1\). The cost of retrieving the UVpartition that contains \(q_1\) is thus saved.
3. The UVpartition Query. The UVdiagram can also be used to retrieve the distribution and pattern information about nearest neighbors, which can be useful for analysis purposes (e.g., [41]). One such “patternanalysis” operation is the UVpartition query. Given a region \(R\), this query retrieves all UVpartitions inside \(R\) and the “density” of each partition \(R_j\) (which is equal to the number of objects associated with \(R_j\), divided by the area of \(R_j\)). This allows a user to examine the density distribution of the nearest neighbors in his/her interested area \(R\).
4. The UVcell Query. This is another pattern analysis operation. Given an object \(O_{i}\), it returns the extent and the area of \(O_{i}\)’s UVcell. The query user can then obtain the area of the region where \(O_{i}\) may be the nearest neighbor. This area can reflect the “influence” of \(O_{i}\) (in terms of the nearest neighbor information). The shape of the UVcell can also be displayed on the user’s computer screen for further analysis.
Since the UVdiagram is expensive to construct, in Sect. 6, we revisit how the above queries can be implemented by the UVindex, which is an approximate version of the UVdiagram. We next address the UVcell in detail.
4 More about UVcells
We now investigate the UVcell, which is important for constructing the UVindex, in more details. We first present a simple method for constructing a UVcell in Sect. 4.1. We then examine the shape of a UVcell in Sect. 4.2. The number of edges of a UVcell, and its size, is studied in Sects. 4.3 and 4.4, respectively.
4.1 Constructing a UVcell
We can now present a simple method for constructing a UVcell. Let us define the outside region:
Definition 2
The outside region of UVedge \(E_i(j)\), denoted by \(X_i(j)\), is the region on one side of \(E_i(j)\) such that for any point \(q \in X_i(j)\), \(O_j\) is always closer to \(q\) than \(O_i\). We call the complement of \(X_i(j)\) the inside region, denoted by \(\overline{X_i(j)}\).
Definition 3
According to the definition, the possible region should be an area that completely covers the UVcell of \(O_i\). An example of an object’s possible region is the domain \(D\), since \(D\) must cover any UVcell. Here, \(R\) is the empty set. Notice that a UVcell is also a possible region.
In Fig. 3, the outside region of the UVedge \(E_i(j)\) is the area on the right of the solid line. Notice that since \(q_0\) is in the outside region of \(E_i(j)\), \(O_j\) is closer to \(q_0\) than \(O_i\), and thus \(O_i\) cannot be \(q_0\)’s nearest neighbor.

Case 1: \(s\) is inside \(P_i\): We refine \(P_i\) by using \(s\) as one of the new edges of \(P_i\). Some existing edges of \(P_i\) are removed if necessary.

Case 2: \(s\) is outside \(P_i\) (except the end points of \(s\)): \(P_i\) cannot be changed by \(s\), and we do not have to do anything to handle this case.
Note that the order of selecting the object for refining \(O_i\)’s possible region (Steps 4–6) does not affect the correctness of the algorithm; the UVcell is produced by “shrinking” the possible regions by using the outside regions of other objects. Also, not all objects are useful in shaping a UVcell. In Sect. 5, we will explain how to prune away these objects.
4.2 The shape of a UVcell
We now study a mathematical representation of the UVcell. We also derive the number of UVedges of a given UVcell. Here, we assume that the uncertainty region of \(O_i\) is a circle, with center \(c_i\) and radius \(r_i\), with \(r_{i}>0\). Later, we discuss the UVcell of a “point uncertainty” (i.e., \(r_{i}=0\)), and also uncertainty regions that are not circle in shape.

\(a=\frac{r_i + r_j}{2}\), \(c=\frac{dist(c_i,c_j)}{2}\), and \(b=\sqrt{c^2a^2}\);

\(x_{\theta }=(x  f_x)\cos \theta + (y  f_y)\sin \theta \);

\(y_{\theta }=(f_x  x)\sin \theta + (y  f_y)\cos \theta \).
Equation 7 shows that a UVcell is composed of the intersections of one or more UVedges, which are hyperbolas. Since a hyperbola is a conic curve, an UVedge must be concave in shape. In Fig. 2, apart from the edges of the domain space, the UVcells of the three objects have concave edges. Note that Eq. 7 has two curves, which represent the UVedges for each pair of objects involved. For example, in Fig. 3, the solid line is the UVedge of \(O_i\) w.r.t. \(O_j\), and the dotted line is the UVedge of \(O_j\) w.r.t. \(O_i\).
If two objects overlap, then \(dist(c_i,c_j) < r_i+r_j\), and in Eq. 7, \(b\) is not a real number. Physically, this means \(E_i(j)\) cannot be found, and we can treat \(X_i(j)\) as an empty region.
We now revisit Algorithm 1. Step 4 is done using Eq. 7. Step 5 is performed by observing that the outside region of a UVedge must be convex in shape. To perform Step 6 (i.e., cutting the possible region by an outside region), we compute the intersections of hyperbola equations using linear algebra techniques [3], which are detailed in Appendix 9.
Let us state an interesting observation about a possible region, which we will use later.
Lemma 1
The possible region of an uncertain object is a connected region without any hole inside it.
The proof of this lemma, detailed in Appendix 10, shows that a contradiction will result if a possible region contains a hole. We next discuss the shape of the UVcell for other kinds of uncertainty regions.

If \(c_i\ne c_j\), without loss of generality, assume that \(r_{i}=0\). Then, \(E_i(j)\) can be obtained by Eq. 7, because all variables used in that equation are real numbers, and \(a,b\) are nonzero values. Notice that \(E_i(j)\) becomes a perpendicular line segment when \(r_i=r_j=0\).

If \(c_i=c_j\), then \(E_i(j)\) does not exist. If \(r_i\ne r_j\), Eq. 1 does not hold, and the UVcell of \(O_{i}\), or \(U_{i}\), does not exist. If \(r_{i}=r_{j}\), Eq. 1 always holds, and \(U_{i}=D\).
4.3 The number of UVedges of a UVcell
Let us now examine the number of UVedges of a UVcell. As Algorithm 1 shows, for every object \(O_{i}\), its UVedge with respect to other objects is used to refine its possible region \(P_{i}\) (Step 6). This requires computing the intersections of all edges of \(P_i\) with a new UVedge \(E_i(j)\), for some object \(O_j\).
As shown in Fig. 4b, \(E_i(j)\) intersects with \(U_i\)’s UVedge \((v_1^{\prime },v_2^{\prime })\) at \(v_5\) and \(v_6\). Thus, \((v_1^{\prime },v_2^{\prime })\) is replaced by three edges: \((v_4,v_5)\), \((v_5,v_6)\), and \((v_6, v_7)\).
From this example, we can see that \(E_i(j)\), a hyperbolic curve, can have at most 2 intersections with a UVedge of \(P_i\); and 3 new edges can be created for \(P_i\) as a result. In the worst case, the number of edges of \(P_i\) increases by 3 times whenever a new edge is considered.
In general, to obtain \(U_{i}\), we have to take into account \(n1\) objects. Hence, the number of edges of the UVcell has an (exponential) upper bound of \(O(3^{n})\).
Moreover, computing intersections between hyperbolas is complex. In our implementation, 60 h are needed to create a UVdiagram of 40,000 objects by using Algorithm 1. We will explain how to find an efficient representation of the UVcell, in Sect. 5.
4.4 The size of a UVcell
Let \(H(d)\) be a set of six objects whose uncertainty region’s centers have the same distance \(d\) from that of \(O_1\). For example, Fig. 6, the centers of the uncertainty regions of \(H(d)=\{O_{2},\ldots ,O_{7}\}\) are \(d\) units away from that of \(O_{1}\). We claim the following:
Lemma 2
In the sequel, we use \(s(d)\) to denote the size of \(P_{1,d}\). The proof of Lemma 2 can be found in Appendix 11. Notice that \(P_{1,d}\) contains \(U_{1}\).
Theorem 1
If \(d_{0} > 2\sqrt{3} r\), then the square that minimally contains \(U_{1}\) has a size of \(s(d_{0})\) obtained from Eq. 8.
The main idea of the proof is that when \(d_{0} > 2\sqrt{3} r\), the six objects that form \(HEX_{1}\) alone contribute to the edges of \(O_{1}\)’s UVcell. Its details can be found in Appendix 12.
An iterative approach of finding \(d^{*}\). We now explain how to derive the size of a square that contains \(U_{1}\), for any value of \(d_{0}\). Our goal is to find \(d^*\) among different values of \(d\), such that the square covering \(P_{1,d^*}\) is the smallest. We observe from the firstorder derivative of \(s\) from Eq. 8 that \(d^{+}=2\sqrt{3}r\) is the only inflexion point, such that \(s\) monotonously decreases when \(\frac{4r}{\sqrt{3}}<d<d^+\), and monotonously increases when \(d\ge d^+\). However, this result cannot be readily used, since we may not be able to find six neighbors of \(O_{1}\) that are exactly \(d^{+}\) units apart from each other. We thus estimate \(d^*\) as follows. We first consider the objects on \(HEX_{1}\), and compute \(s(d_{0})\). We then consider \(HEX_{2}\), where each vertex is \(\sqrt{3}d_{0}\) from \(c_{1}\), and evaluate \(s(\sqrt{3}d_{0})\). We repeat this process, until the six objects found are \(d_{x}\) units apart from each other, where (1) \(d_x > \frac{4r}{\sqrt{3}}\) and (2) \(s(d_x) < s(\sqrt{3}d_x)\). Then, we set \(d^*=d_{x}\), and use Lemma 2 to find \(s\).
The above process only examines the values of \(d\) at \(d_{0}, \sqrt{3}d_{0},(\sqrt{3})^{2}d_{0},\ldots , \sqrt{D}\). Hence, at most \(\Big \lceil log_{\sqrt{3}}(\sqrt{D}/d_0)\Big \rceil \) trials are needed to find \(d^{*}\). Although this procedure does not find the square that tightly contains a UVcell, our experiments show that the approximation is highly accurate.
5 Efficient UVcell generation
Since generating a UVcell is inefficient, our strategy is to avoid computing it directly. Instead, we represent a UVcell as a set of candidate reference objects (crobjects), which can be efficiently derived. As will be discussed in Sect. 6, crobjects can be used to develop an approximate representation of the UVdiagram. Section 5.1 outlines the algorithm of yielding crobjects. We explain the preparation phase of this algorithm as well as two techniques for finding these objects quickly, in Sects. 5.2 and 5.3, respectively. Section 5.4 discusses how to derive crobjects efficiently for a group of nearby objects.
5.1 Reference objects and candidate reference objects
Recall from Algorithm 1 that the UVcell of an object \(O_i\), that is, \(U_i\), is the result of repeatedly subtracting the outside region of other objects (i.e., \(X_i(j)\)) from its possible region, \(P_i\). In fact, not all outside regions are useful for refining \(P_i\). In particular, if the UVedge of \(O_i\) corresponding to \(O_j\), that is, \(E_i(j)\), does not intersect with \(P_i\), then \(P_i\) cannot be shrinked by \(X_i\)(j). We call an object \(O_j\) a reference object (or robject) of \(O_i\), if \(O_j\) defines an edge of \(O_i\)’s UVcell. We also denote \(F_i \subseteq O\) to be the set of robjects of \(O_i\). The set \(F_i\) contains objects whose outside regions are responsible for defining the UVcell of \(O_i\). In Fig. 2, for example, the set of robjects of \(O_3\), that is, \(F_3\), is \(\{O_1,O_2\}\).
Given that the robjects for each object are known, our solution (to be shown in Sect. 6) can use robjects to develop an alternative representation of the UVdiagram. This solution is much cheaper than Algorithm 1, which requires exact UVcells to be computed. However, finding \(F_i\) itself is difficult because we do not know the UVcell of \(O_i\). Our strategy is to find a small set \(C_i\) of objects, where \(F_i \subseteq C_i\). We call \(C_i\) the candidate reference objects (or crobjects in short). We next show how \(C_i\) can be derived without acquiring the exact UVcell of \(O_i\). In Sect. 6, we study an indexing solution based on crobjects.
5.2 Generating a possible region
In Step 1 of Algorithm 2, we retrieve a small number of objects, called seeds, from a set of objects \(S\). These seeds are used to generate an “initial” possible region, using a routine similar to Steps 3–7 of Algorithm 1. This region is used by other pruning methods to produce crobjects.

Step (i). We issue a \(k\)NearestNeighbor Query (\(k\)NN) on \(S\), by using the center \(c_i\) of \(O_i\)’s uncertainty region as the query point. The \(k\) objects, which are not \(O_i\) and whose regions’ minimum distances from \(c_i\) are the shortest, are obtained. Since these objects are close to \(O_{i}\), we consider them to have a good chance for defining the UVedges of \(U_{i}\). They are thus good candidates for being seeds. Note that if \(S=O\), then the Rtree on \(O\) can be used to support the \(k\)NN search.

Step (ii). Out of the \(k\) objects obtained from Step (i), we select \(k_{s}\) seeds. These objects are chosen in way such that they are evenly spread, in order to generate a “good” possible region. In particular, we divide the domain \(D\) into \(k_s\) equally sized sectors, centered at \(c_i\). For each sector, the object closest to \(c_i\) is a seed.
5.3 IPruning and CPruning
Once the possible region has been initialized, we perform Ipruning and Cpruning (Steps 2 and 3 of Algorithm 2), in order to remove objects that cannot constitute a UVedge to the UVcell. Let us now examine these two steps in more details.
Step 2: Index Level Pruning. To understand this step, let us consider an object \(O_i\), its possible region \(P_i\), and another object \(O_j\), which has not yet been considered for refining \(P_i\). Our goal is to establish the necessary and sufficient condition(s) for \(O_j\) to have effect on the shape of \(P_i\). We first claim the following.
Lemma 3
\(P_i = P_i  X_i(j)\), if and only if for every point \(p\) inside \(P_i\), \(dist_{max}(p, O_j)> dist_{min}(p, O_i)\).
Proof
(If) For every \(p \in P_{i}\), \(p\) cannot be on \(X_{i}(j)\). If this is false, then \(O_{j}\) is always closer to \(p\) than \(O_i\), i.e., \(dist_{max}(p, O_j) \le dist_{min}(p, O_i)\) (Definition 2). This violates the condition that \(dist_{max}(p, O_j)> dist_{min}(p, O_i)\). Hence, \(p \notin X_{i}(j)\), and \(P_i = P_i  X_i(j)\).
(Only if) Suppose there exists a point \(p^{\prime }\) inside \(P_i\), such that \(dist_{max}(p^{\prime },O_j) \le dist_{min}(p^{\prime },O_i)\). Then \(O_j\) is always closer to \(p^{\prime }\) than \(O_i\), and \(O_i\) cannot be the nearest neighbor of \(p^{\prime }\). This implies that \(p^{\prime }\) must be excluded from \(P_i\) after \(O_j\) is considered. Hence, \(P_i\) cannot be equal to \(P_iX_i(j)\), resulting in a contradiction.
Let \(b(P_i)\) be the boundary of \(P_i\). We have:
Theorem 2
\(P_i = P_i  X_i(j)\) if and only if for every point \(p \in b(P_i)\), \(dist_{max}(p, O_j) > dist_{min}(p, O_i)\).
Proof
Hence, Eq. 10 is true. Using Lemma 3, we see that \(P_i = P_i  X_i(j)\), and the so the “if” part is correct.
(Only if) From Lemma 3, we know that for every point \(p \in P_{i}, dist_{max}(p, O_j)> dist_{min}(p, O_i)\). Since \(b(P_{i}) \subseteq P_{i}\), the “only if” part is correct.
Essentially, if we want to examine whether \(O_j\) has any effect on \(P_i\), it suffices to consider the points on \(P_i\)’s boundary, instead of all points in \(P_i\). We next present the following theorem, which forms the basis of Ipruning.
Theorem 3
Given an object \(O_i\) with center \(c_i\) and radius \(r_i\), let \(w\) be the maximum distance of \(P_i\) from \(c_i\). Let \(C_{out}\) be a circle, with center \(c_i\) and radius \(2wr_i\). For another object \(O_j\), if \(c_j \notin C_\mathrm{out}\), then \(P_i = P_i  X_i(j)\).
Proof
The Ipruning method uses Theorem 3 by issuing a circular range query, centered at \(c_i\) with radius \(2wr_i\), on the dataset \(O\). This operation can be easily implemented by using the Rtree created for \(O\). The range query first uses the Rtree to filter all objects that do not overlap with the range. For the remaining objects, they are removed if their centers are beyond the circular range. Hence, in this phase (Step 2 of Algorithm 2), a cost of \(O(n)\) is needed.
Step 3: Computational Level Pruning.
We next discuss a method, based on distance comparison, to check whether object \(O_j\) can affect the possible region of object \(O_i\). We call this method Cpruning (Step 3 of Algorithm 2). Theorem 4, discussed below, serves as the foundation of Cpruning.
Theorem 4
Given an uncertain object \(O_i(c_i,r_i)\) and \(P_i\)’s convex hull \(CH(P_i)\), let \(v_1, v_2, \ldots , v_n\) be \(CH(P_i)\)’s vertex. If another object \(O_j\)’s center \(c_j\) is not in any of
\(\{\odot (v_m, dist(v_m, c_i))\}_{m=1}^n\), then \(P_i=P_iX_i(j)\).
Proof
For convenience, let \(\odot (p, dist(p, c_i))\) be a \(w\)bound (where \(w=dist(p, c_i)\)). We also define a set \(S\) of \(w\)bounds for every point \(p\) in \(U_i\). We now show that instead of checking all the \(w\)bounds in \(S\), it is only necessary to check those \(w\)bounds constructed for the vertices of \(CH(P_i)\). Specifically, the \(w\)bounds of the vertices must contain all other \(w\)bounds of all points on the boundary of \(CH(P_i)\). To see this, let \(w_k\) be the distance of vertex \(v_k\) from \(O_i\)’s center. We extend each vertex \(v_k\) by the distance \(w_k\) to obtain a new vertex \(v_j^{\prime }\) (black dot in Fig. 8b). These new vertices are connected to form a polygon. We use \(e_1\) and \(e_2\) to represent the \(w\)bounds \(\odot (v_1, w_1)\) and \(\odot (v_2, w_2)\), respectively.
We next show that, for any point \(v^{\prime }\) on \(CH(P_i)\)’s edge \(v_1v_2\), \(\odot (v^{\prime }, dist(v^{\prime }, c_i)) \subseteq e_1 \cup e_2\), where we let \(e^{\prime }=\odot (v^{\prime }, dist(v^{\prime }, c_i))\). We draw a line \(c_1c_1^{\prime }\), which is perpendicular with \(v_1v_2\) and \(v_1^{\prime }v_2^{\prime }\), and intersects them at points \(c_1\) and \(c_1^{\prime }\), respectively. As \(v_1v_2\) is the perpendicular bisector of \(c_ic_1^{\prime }\), we see that \(c_ic_1^{\prime }\) is the common chord of \(e_1\), \(e_2\) and \(e^{\prime }\). Since \(e_1\) or \(e_2\) is bigger than \(e^{\prime }\), \(e^{\prime }\) is contained by \(e_1\) or \(e_2\).
Hence, to check whether \(O_j\) can refine \(P_i\), we just need to check the set of \(w\)bounds \(S^{\prime }=\{\odot (v_m, dist(v_m, c_i))\}\) (where \(S^{\prime }\subseteq S\)). If \(c_j\) is located outside all \(w\)bounds in \(S^{\prime }\), then \(CH(P_i)=CH(P_i)X_i(j)\). Finally, since \(P_i\) is completely covered by \(CH(P_i)\), \(P_i=P_iX_i(j)\) must also be true. This completes the proof.
Step 3 of Algorithm 2 uses Theorem 4 to prune unqualified objects returned by Ipruning. This can be done efficiently, because only the vertices of \(CH(P_i)\) are used. Moreover, \(CH(P_i)\) is small, since the possible region is only derived by a small number \(k_s\) of seeds. The complexity of this phase is \(O(n)\).
We consider the objects that remain after this step as crobjects (i.e., \(C_i\)). The complexity of Algorithm 2, for generating \(C_i\)’s of \(n\) objects, is \(O(n(n+k))\).
5.4 Batch processing of crobjects
To create the UVindex, we first find out the crobjects for each of the \(n\) database objects. A simple way to do this is to run Algorithm 2 (i.e., getcrObject(\(O_i, O)\)) for all objects \(O_i \in O\), as proposed in [17]. However, this involves running getcrObject for \(n\) times and can be quite costly. We now present a Batch Processing algorithm (or BP in short), where the crobjects of a group of objects are considered together. As we will show, this new algorithm allows the effort of devising an object’s crobjects to be shared by others, and consequently reduces a lot of crobject generation overhead.
Observe that if an object \(O_i\) is near to \(O_j\), then their UVcells should be similar. The crobject set of \(O_i\), that is, \(C_i\), can then be similar to \(C_j\). The BP makes use of this principle; it employs \(C_i\) to derive \(C_j\), instead of generating \(C_i\) and \(C_j\) independently. Let \(G\) be a set of objects that are physically close to each other. The BP first computes a set of objects \(C_G\), a superset of \(C_i\), for every \(O_i \in G\). The crobjects of objects in \(G\) are then extracted from \(C_G\). Usually, \(C_G\) is smaller than the database size \(O\), and thus retrieving crobjects from \(C_G\) is faster than from \(O\).
Step 2 invokes a slightly modified version of getcrObject to obtain a crobject set \(C_G\) of \(O_G\). Particularly, in Step (i) of initPossibleRegion, the \(k\)NN search skips all objects in \(G\). Notice that initPossibleRegion computes the possible region of an object. In Step (i) of that procedure, we obtain the seeds – objects that are useful for generating a UVcell. In Algorithm 3, the input of getcrObject is \(O_G\), whose uncertainty region includes the uncertainty regions of all objects in \(G\). Therefore, the uncertainty region of any object \(O_i \in G\) overlaps with that of \(O_G\). More importantly, \(O_i \in G\) is not useful for finding possible regions of \(O_G\), because \(O_i\) does not create any UVedge for \(O_G\)’s UVcell. We next claim the following:
Theorem 5
Given an object \(O_j\), if \(O_j \notin C_G\) after Step 2 of Algorithm 3, then \(O_j \notin F_i\), where \(O_i \in G\).
That is to say, any object not contained in \(C_G\) cannot be an robject of \(O_i \in G\). In other words, \(C_G\) is a superset of robjects for all the objects in \(G\). The proof of this theorem, which is quite complex, is detailed in Sect. 5.5. Notice that all objects in \(G\) are included in \(C_G\) after the execution of Step 2. This is because in the last step of getcrObject (Algorithm 2), objects whose centers are located in the cpruning bound of \(O_G\) are treated as crobjects. Since the center of an object in \(G\) is inside \(O_G\)’s cpruning bound, it must also be a crobject of \(O_G\). Thus, \(G \subseteq C_G\).
Steps 3–6 use \(C_G\) to generate crobjects for each object \(O_i \in G\). From Theorem 5, we know that an object in \(C_G\) may be an robject of \(O_i\). Thus, objects in \(C_G\) can be considered as good candidates for generating an initial possible region, \(P_i\) for \(O_i\). We thus pass \(C_G\) to initPossibleRegion and get \(P_i\) (Step 4). We then execute cPrune on \(C_G\) to retrieve \(C_i\) (Step 5). These two steps are repeated for all objects in \(G\), until we obtain their crobjects.^{3}
The LP algorithm. We now discuss a way to use Algorithm 3 to construct crobjects for \(O\). The LeafNode Processing, or LP, performs a preorder traversal of the Rtree that indexes \(O\). When a leaf node, say \(N\), is reached, BP is invoked on all objects stored in \(N\), in order to compute their crobjects. The algorithm terminates when all leaf nodes have been exhausted.
The LP can generate crobjects for \(O\) quickly. This is because when the BP is called, it always uses the objects located in a leaf node. In an Rtree, the leaf node consists of a set \(G\) of objects, which are physically close to each other. Recall that the object created in Step 1 of BP (i.e., \(O_G\)) is the MBC of the uncertainty regions of objects in \(G\). Thus, the size of \(O_G\) would not be very different from those of the objects in \(G\). Consequently, the set \(C_G\) derived from Step 2 (getcrObject) should also be similar to the robjects of \(G\)’s objects. In our experiments, \(C_G\) is much smaller than \(O\). Hence, Step 4 can be carried out more efficiently than if initPossibleRegion is carried out on \(O\) for every object.
We have introduced an efficient construction method to derive the crobject set \(C_i\) for \(O_i\). We have also explained how to obtain crobjects for \(O\) quickly. One may consider to use \(C_i\) to generate the exact UVcell of \(O_i\). However, our experiments show that \(C_i\) may be large, and so generating the UVcell of \(O_i\) can still be costly. In Sect. 6, we show how to use \(C_i\) directly to construct an index for the UVdiagram. In the rest of this section, we present the proof of Theorem 5.
5.5 Proof of Theorem 5
Recall that \(O_G\) is formed by a set \(G\) of objects, using Step 1 of Algorithm 3. Let \(P_i(S)\) be a possible region of an object \(O_i\), constructed by using a set \(S \in O\) of objects. Essentially, \(P_i(S)\) is the intersection of the inside regions \(\overline{X_i(k)}\), where \(O_k \in S\). Let \(u_i\) be the uncertainty region of \(O_i\). We first claim the following.
Lemma 4
Given a set \(S\) of objects, where \(S \subseteq O\), for any objects \(O_i\) and \(O_k\), if \(u_i \subseteq u_k\), then \(P_i(S) \subseteq P_k(S)\).
Lemma 5
Lemma 6
Given two objects \(O_i\) and \(O_k\), where \(u_i \subseteq u_k\), and an object \(O_j\) where \(c_j \notin P_k(S)\), if \(P_k(S) = P_k(S)  X_k(j)\), then \(P_i(S) = P_i(S)  X_i(j)\).
As shown in Fig. 9, Lemma 6 claims that given an object \(O_j\) whose center is outside \(P_k(S)\), if the edge \(E_k(j)\) does not affect the possible region \(P_k(S)\), then \(E_i(j)\) cannot contribute to \(P_i(S)\).
Proof

Case 1: \(O_j\) contributes an edge to \(P_i(V)\). In other words, \(O_j \in V\). Since an object in \(V\) is not pruned by Step 2 of Algorithm 3, \(V \subseteq C_G\), and so \(O_j \in C_G\). However, this contradicts with the assumption that \(O_j \notin C_G\), and so this case cannot occur.

Case 2: \(O_j\) does not contribute an edge to \(P_i(V)\). Since the UVcell \(U_i\) of \(O_i\) must be inside \(P_i(V)\), \(O_j\) cannot contribute an edge to \(U_i\). Hence, \(O_j\) is not an robject of \(O_i\), and the theorem holds.
6 The UVindex
We now present the UVindex, an approximate version of the UVdiagram. The UVindex can be efficiently computed and stored. It also facilitates efficient query evaluation. Section 6.1 gives an overview of its structure. In Sect. 6.2, we discuss how to use this index to support execution of different queries. We explain its construction process in Sect. 6.3.
6.1 Structure of the UVindex

ID is the identity of object \(O_i\) whose UVcell may overlap with the region covered by \(l\);

MBC is the circle that minimally bounds the uncertainty region of \(O_i\); and

pointer stores the disk page address of the object.
We assume that all nonleaf nodes are stored in the main memory, and allocate a maximum number of \(M\) nonleaf nodes. The leaf nodes, which contain the lists of pages, are stored in the disk. Hence, \(M\) controls the amount of main memory to be used to implement the index. Next, we study how to use it to support query evaluation.
6.2 Using the UVindex
We now explain how to use the UVindex to support the queries that we described in Sect. 3.2.
1. The PNN Query. To find the probabilistic nearest neighbors of \(q\), we first locate the leaf node \(l\), whose region contains \(q\). This can be done easily by finding the grid that contains \(q\) in each index level, and traversing the index. We then retrieve the disk pages associated with \(l\), which contains the ID and the MBC values of the objects stored in these pages. Since these objects may have their UVcells overlap with the region of \(l\), it is possible that \(q\) is contained in their UVcells. Let \(L\) be the set of objects associated with \(l\), and \(A\) be the answer objects of \(q\). To answer a PNN, we need to retrieve \(A\) from \(L\), where \(A \subseteq L\). We use the method described in [15]: from the set of the MBC’s of the objects in \(L\), find \(d_{minmax}\), the minimum of the maximum distances of these objects from \(q\). Any object with the minimum distance larger than \(d_{minmax}\) is removed, since it cannot have a nonzero qualification probability. For objects that are not filtered, their probabilities are computed and returned as answers.
2. The CPNN Query maintains the PNN answers for a “moving” query point, whose location is periodically reported to the server. Let \(q_0\) be the latest position of \(q\) received by the server. Let \(g_0\) be a leaf node in the UVindex, whose region \(r_0\) contains \(q_0\). We assume that the objects stored in the disk pages associated with \(g_0\) are known. Now, suppose the new location of \(q\), say, \(q_1\), is received by the server. A straightforward solution is to treat \(q_1\) as a new PNN query, and use the PNN algorithm described above to compute the answers of \(q_1\). A better way is to check whether \(q_1\) is inside \(r_0\). If this is true, we simply use the object set associated with \(g_0\) to compute the answer for \(q_1\). This saves the effort of traversing the UVindex for \(q_1\).
3. The UVPartition Query. We append a counter to each leaf node, and record the number of objects at that node. This process could be done after the UVindex is constructed. Then, a range query with range \(R\) is issued over the index, in order to find the leaf nodes whose regions overlap with \(R\). For every leaf node whose region \(r\) overlaps with \(R\), we compute its density, which is equal to the number of objects associated with \(r\), divided by the area of \(r\). The query then outputs \(r\) and its density value.
4. The UVcell Query. Notice that if an object \(O_i\) appears in a leaf node \(g\), its UVcell overlaps with the region of \(g\). Hence, we can return the approximate area and the extent of \(O_i\)’s UVcell by scanning the leaf nodes associated with \(O_i\), and then summing up the total area of the regions covered by these nodes. This step can be improved by precomputing and storing the area information. For example, we can scan all the leafnodes once, and generate a table for each \(O_i\) with its respective areas. A similar procedure can be used to support the operation of displaying the approximate shape of the UVcell on the user’s screen.
6.3 Construction of the UVindex
As discussed in Sect. 5, a UVcell can be represented by a set of crobjects, \(C_i\). We now examine how this facilitates the construction of the UVindex.
Framework. Let \(g\) be the grid node being examined, and \(h_k\) (where \(k=1,\ldots ,4\)) be the four child nodes of \(g\). We define a variable nonleafnum, which indicates the number of nonleaf nodes allocated to the index and has an initial value of 1. Originally, the root of the grid is a leaf node, whose region covered (root.region) is the domain \(D\).
 1.
NORMAL (Steps 9–11): \(g\)’s pages still have space left, and so (\(i\), \(MBC_i\), \(ptr(O_i)\)) is inserted to \(g\)’s page, where \(ptr(O_i)\) is the pointer to \(O_i\)’s uncertainty region and pdf.
 2.
OVERFLOW (Steps 12–15): \(g\)’s pages are full, and a new disk page has to be associated with \(g\), before the information about \(O_i\) is inserted to the new page.
 3.
SPLIT (Steps 16–22): \(g\)’s pages are full. The page list \(g\) is removed. Then, \(g\) becomes the parent of four nodes (\(h_k\)), which have been previously generated by CheckSplit. The region of each child node \(h_k\) covers each of the four quarters of the region defined for \(g\). Also, nonleafnum is incremented by a value of 1. Essentially, The information about the UVcells previously associated with \(g\) are now represented by its child nodes, and \(g\) becomes a nonleaf node.
In fact, splitting is not always useful. Suppose that \(g\).region is associated with 100 UVcells. Moreover, \(g\).region is completely covered by each of these UVcells. Then, it is not necessary to redistribute \(g\) into four child nodes. If splitting is performed in this case, then the UVcells associated with each child node are exactly the same.
Lemma 7
If region \(r\) is totally covered by \(X_i(k)\), where \(O_k \in C_i\), then \(r\) must not overlap with the UVcell \(U_i\).
Proof
To check whether a region \(r\) is contained in \(X_i(j)\) (Step 2), a simple way is to generate and test with the UVedge \(E_i(j)\). This can be avoided, by carrying out a simple 4point test. Observe that \(r\) is a square, and the UVedge of \(O_i\) with respect to \(O_j\) is concave in shape. If all its four corner points are confirmed to be in \(X_i(j)\), we conclude that \(r \subseteq X_i(j)\). Figure 11b shows that the region of \(g_1\) must not overlap with \(U_i\), since all the four corners of \(g\) are located on the outside region of one of the UVedges. Checking whether a point is in \(X_i(j)\) is easy, because we can simply check whether the point’s minimum distance from \(O_i\) is larger than its maximum distance from \(O_j\). We thus use the fourpoint test in Step 2.
Notice that Algorithm 6 may incorrectly judge that \(U_i\) overlaps with \(r\). Figure 11(b) shows that \(U_i\) does not overlap with the region of grid \(g_2\). However, some corners of \(g_2\).region are not contained in the outside regions of two of the UVedges of \(U_i\). If this is true for all UVedges of \(U_i\), then \(U_i\) would be decided to be associated with \(g_2\)! If this happens, then during query evaluation, \(O_i\) will be retrieved from \(g_2\). This increases the execution time since \(O_i\) is not in \(g_2\). However, query accuracy is not affected, since we can still detect that \(O_i\) is not a nearest neighbor of \(q\). In our experiments, this situation is rare, and does not have a significant effect on query evaluation time.
Since \(C_i=O(n)\), Algorithm 6 needs \(O(n)\) time to complete. Algorithm 5 uses \(O(n^2)\) time, mainly for performing splitting and overlap checking with four child nodes. For Algorithm 4, each UVcell, in the worst case, needs to perform overlap and split tests with \(M\) nonleaf nodes. Hence, its total time complexity is \(O(Mn^2)\). The index has a maximum height of \(M/4\), if, the data distribution is very skewed, and splitting always happens in one single quadrant. However, all nonleaf nodes, 16byte long, can all be stored in the main memory. Thus, the tree height has little effect on query performance.
7 Results
We now report the results. Section 7.1 describes the experiment settings. In Sect. 7.2, we discuss the results about query performance. Section 7.4 presents the results about UVindex construction.
7.1 Setup
We use both synthetic and real datasets in our experiments. For synthetic data, we use Theodoridis et al’s data generator^{5} to obtain 20, 40, 60, 80, and 100K objects, which are uniformly distributed in a 10K \(\times \) 10K space. Each object has a circular uncertainty region with a diameter of 40 units, and a Gaussian uncertainty pdf. For each uncertainty pdf, its mean is the center of the circle, and its variance is the square of onesixth of the uncertainty region’s diameter. We represent an uncertainty pdf as 16 histogram bars, where a histogram bar records the probability that the object is in that area. We also use three real datasets of geographical objects in Germany, namely utility, roads, and rrlines, with respective sizes 17, 30, and 36K. We also test the Long Beach (or LB) dataset, which contains 53K objects.^{6} These objects are represented as circles before indexing, and has the same uncertainty pdf information as that of the synthetic data.
To compare with Rtree, we use a packed R*tree [30] to index uncertain objects. The Rtree uses 4Kbyte disk pages, and has a fanout of 100. We keep all its nonleaf nodes in the main memory.
For the UVindex, each nonleaf node has four 4byte pointers to its children. We set \(M\), the number of nonleaf nodes in the main memory, to be 10,000. The leaf nodes of both indexes, as well as the uncertainty information about the objects, are stored in the disk.
For \(T_\theta \), the splitting threshold used in constructing the UVindex, we have performed a sensitivity test. Under a wide range of \(T_\theta \), the indexes only have a slight performance difference. For very small values of \(T_\theta \) (e.g., 0.2), however, the adaptive grid tends not to split, and degrades into long linked lists of pages. Here, we set \(T_\theta \) to be 1.We wrote the programs in C++ and tested them on a Core2 Duo 2.66 GHz PC.
7.2 Results on query evaluation
We first study the performance of the queries studied in Sect. 3.2. We assume that the LP algorithm, presented in Sect. 5.4, is used to generate the UVindex. However, as we will discuss later, the different UVindex construction methods described here has little effect on query performance.
To understand why our method performs better, let us examine the traversal time of the UVindex, which is composed of the time costs for visiting nonleaf and leaf nodes. Since its nonleaf traversal time takes little time in all experiments (up to 3.9 \(\mu \)s), we only present the I/O overhead. In Fig. 12b, we compare the I/O performance of the UVindex and the Rtree. The UVindex requires significantly less number of I/Os than the Rtree (e.g., when \(O=60\)K, the UVindex consumes about onefifth of the I/Os needed by the Rtree). When the Rtree is used to process a PNN query, plenty of leaf nodes needed to be retrieved. For the UVindex, we only need to look for the leaf node that contains the query point. Since the number of disk pages for each leaf node is also small, a high I/O performance can be attained. Also notice that the number of I/Os for the Rtree increases with \(O\), whereas that of the UVindex is relatively stable.
Results on real datasets
Dataset 
\(O\) (K) 
\(T_q\) (UV) (ms) 
\(T_q\) (Rtree) (ms) 
\(T_c\) (s) 
\(p_c\) (%) 

utility 
17 
89 
141 
569 
97.45 
roads 
30 
82 
135 
1,195 
97.80 
rrlines 
36 
107 
159 
1,340 
98.30 
LB 
53 
109 
173 
1,579 
98.22 
2. The UVPartition and the UVcell Queries. We now examine the efficiency of our index for answering the UVpartition query on our synthetic dataset. For each size of a query region \(R\), 50 queries are generated, whose centers of \(R\) are uniformly distributed in the data domain. We can see from Fig. 12e that the retrieval time of UVpartitions (\(T_q\)) increases with the size of \(R\), since more UVpartitions are loaded when \(R\) becomes larger. The increase is almost linear, and the query evaluation time is less than 160 ms. We have also examined the performance of the UVcell queries on the default synthetic dataset. On average, the time for obtaining a UVcell from the UVindex is 58.46 ms, or equivalently, 4.62 I/Os. Thus, running a UVcell query costs little time in our experiments.
3. The CPNN query. To generate a CPNN query, we use the CanuMobiSim simulator,^{8} which produces a movingpoint trajectory. The movement of a query point follows a random walk model, as detailed in [34]. The location of a query point, which changes at a maximum speed of 100 units per second, is reported every second. The default “trajectory length” of a query is 60, that is, each query has 60 location reports. In our experiments, each data point is the average of 50 queries.
7.3 Storage cost analysis
7.4 Results on UVIndex Construction

Basic: a UVcell is derived using Algorithm 1, which is then used to build the UVindex;

ICR (I and Cpruning with Refinement): collect crobjects through I and Cpruning (Algorithm 2), compute UVcells and obtain the robjects, then index them with Algorithm 4.

IC (I and Cpruning): the crobjects obtained through I and Cpruning are used directly to construct the UVindex by Algorithm 4.
IC versus ICR. As shown in Fig. 15c, IC performs much better than ICR. For example, at \(O=80\)K, the construction time of IC is about 10 % of that of ICR. To understand why, we analyze their time components in Fig. 15d, e. Recall the difference between the two methods is that ICR needs to find out the exact robjects (by constructing an exact UVcell based on the objects returned by pruning), while IC does not. For ICR, Fig. 15d shows the fraction of the construction time spent on: (1) seeds selection, (2) initial possible region computation, (3) I and Cpruning, (4) generating robjects, and (5) indexing UVcells. For most datasets, ICR spends most of the time to generate exact robjects, which is very costly. For IC, robject is not produced (Fig. 15e). Instead, the crobjects produced by the pruning methods are immediately passed to Algorithm 4 for indexing. The number of crobjects generated, while larger than that of robjects, does not increase the indexing time significantly.
In Fig. 15f, the construction time of ICR increases sharply with the objects’ uncertainty region sizes. With larger uncertainty regions, it is more likely that these regions overlap with each other, making it harder to prune the objects, so that more time is needed to generate robjects. On the other hand, IC is relatively insensitive to the change in uncertainty region sizes.
We have also measured the query times between the indexes created by IC and ICR. Figure 15g shows that the UVindex generated by the two methods is highly similar, resulting in a close query performance. The query cost of ICR is about 0.01 I/Os, or 0.13 ms, better than IC. In the sequel, we assume that IC is used.
To understand why Model performs well, we compare the difference between the size \(S\) of a UVcell estimated by our model, and its “true” size. Again, \(S\) is the length of the MBR that tightly bounds the estimated UVcell. The true size of the UVcell can be obtained by using Algorithm 1. Based on the vertices of this UVcell, we obtain its minimum bounding rectangle (MBR). We use the larger length of the two dimensions of this MBR to represent the size of the UVcell. Figure 16c shows the average size of a UVcell under different uncertainty region sizes. The UVcell size increases with the uncertainty region radius, since an object can be in more possible locations. This increases its chance to be a possible nearest neighbor of a query point. In this experiment, our method offers a reasonable estimation of the UVcell’s size—the estimation error is between 4 and 12 %. This enables the selection of seeds, as well as the index construction algorithm, to be effective.
We also test the performance of single and LP on larger datasets. We use the same synthetic data generator to produce two datasets that contain 0.5M and 1M objects. The 1M dataset occupies 640Mbytes. The new result, illustrated in Fig. 17b, shows that the construction performance of both single and LP increases with the dataset size in a linear manner. For the 1M dataset, LP needs 7.7 h, which is 23 % faster than single.
Figure 17c shows that when LP is used, the seed selection time of single is shortened by more than 80 %. While single generates seeds for every object individually, in LP, the seeds of every object in set \(G\) are retrieved from a set of objects \(C_G\) (Step 2 of Algorithm 3). We can also see that the I and Cpruning time required by LP is also less than single; when \(O=60k\), the improvement is over 60 %. In single, Ipruning is done for every object; in LP, Ipruning is only done once for every group. The performance gap is more profound when \(O\) is large, since the same domain is populated with more objects, resulting in more candidates retrieved after Ipruning.
We also examine the effect of the average uncertainty region size on the construction time. As discussed before, the larger this size, the more construction time will be needed. Figure 17d shows that LP is more stable than single. When the uncertainty region size is 60, LP needs more about 60 % time of single; when the size becomes 100, LP is 3.5 times faster than single. In Fig. 17e, we compare the query performance of the UVindices generated by single and LP. We observe that the number of I/Os required by the two methods is the same. Their probability computation times, not shown here, are also very close. Hence, the query performance of two methods is almost the same.
Next, we compare the construction time of the Rtree and UVindex, using single and LP. Figure 17f shows that the construction cost of the Rtree is less than 1 % of that of the UVindex. Hence, the Rtree introduces little overhead to the UVindex construction process. However, it improves the performance of generating the UVindex. For instance, the Ipruning phase can be executed more efficiently with the use of the Rtree.
For real datasets, LP also outperforms single (Fig. 17g). In rrline, for example, LP needs onethird of the time required by single. LP also achieves a high pruning ratio, as shown in Table 2. This explains why LP requires less time to construct a UVindex, compared with single.
Finally, we examine how a skewed distribution of the centers of uncertainty regions can affect our results. We obtain a 60k dataset that follows the zipfian distribution, by using the same generator that produces our uniformly distributed dataset. For the zipfian distribution, the average query I/O costs for IC and ICR are 2.48 and 2.41. Thus, the query performance of ICR is 0.07 I/Os (or 2.8 %) better than IC. Since their time difference is small (around 0.4 ms), we use IC in the rest of the experiments.
Results on zipfian distribution
\(O=60k\) 
Uniform 
Zipfian  

\(LP\) 
\(Single\) 
\(LP\)  
\(T_c\)(hours) 
0.45 
5.78 
2.46 
\(T_q(I/Os)\) 
2.00 
2.48 
2.45 
In the same table, we study the difference between single and LP for zipfian distribution. Notice that LP requires about 42 % of time needed by single. This means that our batch processing method improves the construction performance for zipfian distribution significantly. The query performance of the UVindex constructed by LP is also slightly better (0.03 I/Os) than single.
8 Conclusions
The UVdiagram is a variant of the Voronoi diagram designed for uncertain data. To tackle the complexity of constructing and evaluating a UVdiagram, we introduce the concept of UVcells and crobjects. We study the theoretical size of a UVcell. We propose an adaptive index for the UVdiagram, and develop efficient algorithms for building it. We also present a batch processing algorithm to further reduce the UVindex construction time. Our experiments show that this index efficiently supports PNNs and other UVdiagramrelated queries.
We plan to study the use of the UVdiagram to support other variants of probabilistic NNQs, for example, approximate NNQs [12, 13]; monochromatic and bichromatic reversenearestneighbor (RNN) queries [10, 27, 42]; and \(k\)RNN queries [11]. Another interesting problem is to design a UVdiagram such that whenever a query point is located in a UVcell \(U_i\), we can know that the qualification probability of \(O_i\) is larger than some threshold \(T\). By using this variant of UVdiagram, we can get all the objects with qualification probability larger than \(T\), without computing their actual probabilities. This could be beneficial to queries where a user is only interested in answers with qualification probabilities larger than \(T\). It is also interesting to examine how the UVdiagram can support multidimensional data and incremental updates.
The centers of uncertainty regions form the vertices of \(n\) hexagons, each of which has an area of \(\frac{\sqrt{3}d_0^2}{2}\). Since \(D = n \times \frac{\sqrt{3}d_0^2}{2}\), \(d_0 = \sqrt{\frac{2D}{\sqrt{3}n}}\).
As shown in Fig. 2, a UVcell can be irregular in shape, and so estimating its size is not easy. Thus, we use a simple data model here. We will also explain how these results can be applied to uniformly distributed data, in Sect. 5.2.
We do not execute iPrune \((P_i,O)\) after Step 4 because the set of objects returned by iPrune is often the superset of \(C_G\) in our experiments. Thus, iPrune is not very effective here.
We adopt the quadtree rather than the Rtree. While Rtree MBRs may overlap, quadtree grids do not. Issuing a point query on nonoverlapping UVpartitions in quadtree is thus more convenient than Rtree.
Acknowledgments
Reynold Cheng, Xike Xie, Liwen Sun, and Jinchuan Chen were supported by the Research Grants Council of Hong Kong (GRF Projects 711110, 711309E, 513508). We would like to thank the anonymous reviewers for their insightful comments.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.