Efficient exact k-flexible aggregate nearest neighbor search in road networks using the M-tree

Chung, Moonyoung; Hyun, Soon J.; Loh, Woong-Kee

doi:10.1007/s11227-022-04496-2

Efficient exact k-flexible aggregate nearest neighbor search in road networks using the M-tree

Open access
Published: 04 May 2022

Volume 78, pages 16286–16302, (2022)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

Efficient exact k-flexible aggregate nearest neighbor search in road networks using the M-tree

Download PDF

938 Accesses
2 Citations
Explore all metrics

Abstract

This study proposes an efficient exact k-flexible aggregate nearest neighbor (k-FANN) search algorithm in road networks using the M-tree. The state-of-the-art IER-kNN algorithm used the R-tree and pruned off unnecessary nodes based on the Euclidean coordinates of objects in road networks. However, IER-kNN made many unnecessary accesses to index nodes since the Euclidean distances between objects are significantly different from the actual shortest-path distances between them. In contrast, our algorithm proposed in this study can greatly reduce unnecessary accesses to index nodes compared with IER-kNN since the M-tree is constructed based on the actual shortest-path distances between objects. To the best of our knowledge, our algorithm is the first exact FANN algorithm that uses the M-tree. We prove that our algorithm does not cause any false drop. In conducting a series of experiments using various real road network datasets, our algorithm consistently outperformed IER-kNN by up to 6.92 times.

A Survey of Traffic Prediction: from Spatio-Temporal Data to Intelligent Transportation

Article Open access 23 January 2021

A Practical Guide to an Open-Source Map-Matching Approach for Big GPS Data

Article Open access 04 August 2022

Fast and Memory-Efficient Approximate Minimum Spanning Tree Generation for Large Datasets

Article Open access 21 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

This study proposes an efficient k-flexible aggregate nearest neighbor (FANN) search algorithm $(k \ge 1)$. The FANN search is an extension of the aggregate nearest neighbor (ANN) search, which is also an extension of the traditional nearest neighbor (NN) search. The NN search, which finds the object closest to the given query object q among the objects in a dataset $\mathcal {D}$, is an important subject pursued in many applications in various domains [2, 6, 16, 30]. The ANN search [15, 25, 34], which extends the NN search by introducing a query set Q including $M ~ (\ge 1)$ query objects $q_j ~ (0 \le j < M)$, finds an object $p^*$ that satisfies the following Eq. (1):

$$\begin{aligned} p^* = {\mathop {\mathrm{argmin}}\limits _{p_i \in \mathcal {D}}} \left\{ \mathcal {G}\left\{ d(p_i, q_j), q_j \in Q \right\} \right\} , \end{aligned}$$

(1)

where $\mathcal {G}$ denotes an aggregate function such as max and sum, and d() denotes the distance between two objects. An example of applying ANN search is to find an optimal place for a meeting of M members.

The FANN search [21, 22, 33], which extends ANN search by introducing a flexibility factor $\phi ~ (0 < \phi \le 1)$, finds an object $p^*$ that satisfies the following Eq. (2):

$$\begin{aligned} p^* = {\mathop {\mathrm{argmin}}\limits _{p_i \in \mathcal {D}}} \left\{ \mathcal {G}\left\{ d(p_i, q_j), q_j \in Q_\phi \right\} \right\} , \end{aligned}$$

(2)

where $Q_\phi$ denotes any subset of Q of $\phi M$ size. An example of FANN search is to find an optimal place for a meeting of $\phi M$ members, the minimum quorum of M members. The FANN search cannot be solved simply by running an ANN search algorithm for every possible $Q_\phi$. For example, in the case of M = 256 and $\phi$ = 0.5, ANN search must be performed as much as $5.769 \times 10^{75}$ times. In this study, the target of FANN search is the objects in a points-of-interest (POIs) set $P ~ (\subseteq \mathcal {D})$, e.g., hospitals and restaurants, instead of the whole dataset $\mathcal {D}$.

The ANN and FANN searches can be used in various applications [15, 21, 22, 33, 34]. Consider having a meeting of M members $q_0, \dots , q_{M-1}$. In case we want a quick meeting, we will search for $p^*~(\in P)$ that satisfies Eq. (1) for the meeting place. In case we want to minimize the total cost of all members gathering at the meeting place, we will search for $p^*$ for $\mathcal {G}$ = sum. If we can have a meeting with only $\phi M$ members instead of all M members, we will search for $p^*$ that satisfies Eq. (2) instead of Eq. (1). The ANN and FANN searches can also be used in multimedia applications [21, 22]. In a multimedia database, when we want to find objects (e.g., images) similar to all of given M objects, we can perform an ANN search. However, if some of the given M objects are different from the rest, the results of the ANN search may be obtained different from expectations. In this case, accurate query results can be obtained by performing FANN searches by appropriately adjusting the value of $\phi$ to remove such unexpected search results. The ANN and FANN searches can also be used for patent analysis [1, 10, 28, 29]. Rather than conducting a patent search individually for each case, a more comprehensive and efficient patent analysis is possible by performing a search simultaneously for multiple related cases.

The existing ANN and FANN search algorithms have been studied separately for Euclidean spaces and road networks. A road network is represented with a graph data structure, and the distance between two objects is defined as the shortest-path distance between them [2, 33, 34]. Since the calculation of the shortest-path distance has a much higher complexity than that of the Euclidean distance [12, 17, 37], ANN and FANN search algorithms in road networks should minimize the calculations of the shortest-path distances. Yao et al. [33] proposed a few algorithms for exact FANN search in road networks, and among them, the IER-kNN algorithm showed the highest performance. It used the R-tree [24] and pruned off the nodes that are unlikely to include the final result objects, thus reducing the calculations of the shortest-path distances for the objects in the pruned nodes. Nevertheless, when deciding whether to prune a specific node, IER-kNN accessed many unnecessary nodes since the decision is made using the Euclidean distances, which are significantly different from the actual shortest-path distances, and thus performs many shortest-path distance calculations for objects included in the unnecessary nodes.

This study proposes an efficient exact k-FANN search algorithm using the M-tree [13] and proves that the proposed algorithm does not cause any false drop. While the R-tree is an index structure for objects in a Euclidean space, the M-tree is constructed for a dataset in a metric space, where a distance function between objects is given instead of their actual coordinates. The road network can be mapped into a metric space [15, 33], and the M-tree is constructed using the actual shortest-path distances between objects in road networks. Therefore, our algorithm can prune the index nodes more accurately than the state-of-the-art IER-kNN algorithm and can dramatically reduce the calculations of the shortest-path distances. To the best of our knowledge, our algorithm is the first exact FANN algorithm that uses the M-tree. The performance of our algorithm is compared with that of IER-kNN using various real road network datasets. The experimental result demonstrated that our algorithm consistently outperformed IER-kNN for all datasets and parameters, with a performance improvement of up to 6.92 times.

This paper is organized as follows. Section 2 briefly explains the structure of the M-tree and the existing FANN search algorithms. Section 3 describes our algorithm in detail. Section 4 compares the search performance for various real road network datasets and parameters. Finally, Sect. 5 concludes this study.

2 Related work

In this section, we discuss various previous NN, ANN, and FANN algorithms. With the recent spread of ubiquitous mobile devices, the demand for efficient k-NN search in road networks has increased. Abeywickrama et al. [2] evaluated the performance of various existing k-NN algorithms including Incremental Network Expansion (INE) [27], Incremental Euclidean Restriction (IER) [27], Route Overlay and Association Directory (ROAD) [18], and G-tree [36]. They demonstrated as an experimental result using synthetic and real road network datasets that the best performance was achieved with the combination of the previously neglected IER algorithm and pruned highway labeling (PHL) algorithm [4, 5]. Shaw et al. [30] presented an approximate k-NN algorithm using Road Network Embedding (RNE), which maps objects on a road network to a p-dimensional Euclidean space. The algorithm stored the mapped objects in the M-tree and showed the search performance superior to the existing INE [27] algorithm.

Gao et al. [14] dealt with the reverse k-NN (RkNN) problem in road networks. They presented an algorithm based on a heuristic filter-and-refinement framework that simultaneously considers spatial and textual information and demonstrated its efficiency using synthetic and real datasets. Zhao et al. [35] dealt with the problem of diversified top-k geo-social keyword (DkGSK) query that considers spatial, social, and textual constraints between the query and data objects. They considered not only the relevance but also the diversity of the query result in order to enhance the quality of the result. They showed that the problem was NP-hard and proposed an exact algorithm based on several heuristics and an approximate algorithm, whose efficiency was demonstrated using actual datasets.

Ouyang et al. [26] proposed a parameter-free tree index, whose size is bounded by O(nh), where n is the number of vertices and h is the longest depth of the tree, for k-NN queries in road networks. They proposed a progressive search algorithm, which returns i-th NN within O(ih) time $(1 \le i \le k)$, using the index and showed that the algorithm outperformed two existing state-of-the-art algorithms. Barrientos et al. [8] proposed a k-NN algorithm in a multi-node/multi-GPU environment and achieved a speed-up of up to 1,843 times over the existing state-of-the-art GPU-based algorithm using a 5-nodes/20-GPUs platform.

Ioup et al. [15] proposed an ANN search algorithm in road networks using the M-tree [13]. However, this algorithm only returns an approximate result, and the error ratio of the search result is unknown. Miao et al. [25] dealt with the continuous k-ANN (CAkNN) problem in dynamic road networks, where the locations of data and query objects and the edge weights are changing. They defined partial distance matrix data structure that contains only data objects closer than the safe distance r from each query object, where r is not greater than the aggregate distance of the k-th candidate ANN object. They showed that their algorithm was superior to the existing algorithm that assumes static query objects through experiments using actual road network datasets.

Abeywickrama et al. [3] introduced Subgraph-Landmark Tree (SL-Tree) and Compacted Object-Landmark Tree (COLT) based on the lower bound of the distance between two objects in a road network. The lower bound is obtained using m randomly chosen “landmarks.” They proposed a heuristic k-ANN search algorithm using the indexes and achieved up to two to three orders of magnitude improvement. Li et al. [23] dealt with the k-ANN problem in an environment where the object locations in a road network are uncertain. They represented the probabilistic regions of uncertain objects with a Voronoi diagram and proposed Probabilistic Threshold k-ANN (PTkANN) algorithm based on it. Lee and Park [19] dealt with optimal network location for travel planning (ONLTP) query, which minimizes the sum of ANN distance and $d(p^*, v)$, where $p^*$ is the ANN object and v is a travel destination.

Li et al. [21, 22] addressed the FANN search problem in a Euclidean space and proposed algorithms using the R-tree and a list data structure. The R-tree-based algorithm estimates the FANN distance based on the Euclidean distance between $\phi M$ query objects that are nearest to the MBR of each R-tree node and determines whether to prune the node based on the estimated distance. The list-based algorithm finds the final FANN object while gradually constructing the nearest object list for each query object $q_i$. Li et al. [21, 22] conducted various experiments for the algorithms and showed that the R-tree-based algorithm always had a higher search performance.

The FANN search problem in road networks was addressed by Yao et al. [33]. They proposed the Dijkstra-based algorithm, R-List algorithm, and R-tree-based IER-kNN algorithm. In addition, they presented an exact algorithm that does not require an index for $\mathcal {G}$ = max. They experimentally showed that IER-kNN had the best performance for all parameters and road network datasets. However, IER-kNN accessed many unnecessary R-tree nodes and performed many unnecessary shortest-path distance calculations for objects included in the unnecessary nodes. The algorithms that did not use an index showed a much lower search performance than the algorithms using an index. Chen et al. [11] addressed the FANN search problem that took keyword similarity into account in road networks. They defined a new distance function based on both the aggregate of distances to query objects $q_i ~ (\in Q_\phi )$ and keyword similarity. They presented algorithms (denoted as KFANN) by extending the Dijkstra-based algorithm, R-List algorithm, and IER-kNN previously proposed by Yao et al. [33].

Chung and Loh [12] proposed an efficient $\alpha$-probabilistic FANN algorithm named IER-LMDS using the landmark multidimensional scaling (LMDS) [31, 32]. The LMDS converts objects in a road network into those in a Euclidean space so that the relative distances between them are maintained as much as possible. Accordingly, the converted objects can be stored in an R-tree for an efficient FANN search. However, IER-LMDS returns only search results satisfying the given search probability $\alpha ~(0<\alpha <1)$ and may cause false positive and/or false negative. On the other hand, IER-kNN [33] and the algorithms proposed in this study are all exact k-FANN algorithms, so we do not compare with IER-LMDS.

Bachman [7] proposed the SuperM-tree for approximate subsequence or subset search. Examples of subsequences or subsets include partial gene sequences or movie scenes. Čech et al. [9] proposed a k-NN similarity join algorithm for large high-dimensional datasets. They presented distributed algorithms on the Apache Hadoop and Spark platforms based on the metric space similarity defined using reference data objects (pivots). Lee et al. [20] presented a top-k subgraph search algorithm for a given set of query keywords Q. The algorithm is an enhancement of the threshold algorithm (TA), which was previously applied to non-graph structures. The efficiency of the algorithm was verified through experiments using actual and synthetic datasets.

3 FANN-PHL: proposed k-FANN algorithm

The M-tree [13] is a balanced tree index structure similar to the R-tree [24]. While the region for a node of the R-tree is a minimum bounding rectangle (MBR) including all entries in the corresponding node, the region for a node of the M-tree is a sphere defined by a central object (or parent object) and radius. Figure 1a shows the structure of an entry in an M-tree leaf node. A leaf entry corresponds to an object in a dataset. In Fig. 1a, $O_i$ is an object, $oid(O_i)$ is the object ID of $O_i$, and $d(O_i,O_p)$ is the distance between $O_i$ and the parent object $O_p$. The parent object $O_p$ is a central object that represents a leaf node L; among all the objects $O_i$ in L, $O_p$ is chosen such that it satisfies the following Eq. (3):

$$\begin{aligned} O_p = {\mathop {\mathrm{argmin}}\limits _{O_i \in L}} \left\{ \max \left\{ d(O_i, O_j), O_j \in L \right\} \right\} . \end{aligned}$$

(3)

Figure 1b shows the structure of an entry in an M-tree non-leaf node N. A non-leaf entry corresponds to a sub-node n of N. In Fig. 1b, $O_r$ is called the routing object and set as the parent object of n. $r(O_r)$ is the radius of the spherical region of n, $T(O_r)$ is a pointer to the subtree rooted by n, and $d(O_r, O_p)$ is the distance between $O_r$ and $O_p$ of N. The parent object $O_p$ is chosen as the routing object $e_p.O_r$ of the entry $e_p$ such that it satisfies the following Eq. (4) among the entries $e_i$ in N:

$$\begin{aligned}&e_p = {\mathop {\mathrm{argmin}}\limits _{e_i \in N}} \left\{ \max \left\{ d(e_i, e_j), e_j \in N \right\} \right\} , \end{aligned}$$

(4)

$$\begin{aligned}&d(e_i, e_j) = d(e_i.O_r, e_j.O_r) + e_i.r(O_r) + e_j.r(O_r). \end{aligned}$$

(5)

As a condition for using the M-tree, the shortest-path distance function d() between two objects must satisfy the triangle inequality [13]. That is, for any objects $o_1, o_2$, and $o_3$, the inequality $d(o_1, o_2) \le d(o_1, o_3) + d(o_3, o_2)$ should be satisfied. This can be proved simply as follows. If this condition is not satisfied, that is, if $d(o_1, o_2) > d(o_1, o_3) + d(o_3, o_2)$, it indicates that the shortest-path distance between two objects $o_1$ and $o_2$ is longer than the distance of the path through the object $o_3$, which is a contradiction. Therefore, the inequality is satisfied.

In this section, we explain our exact k-FANN search algorithm that uses the M-tree constructed using the actual shortest-path distance D between two objects in a road network. To obtain the distance D between two objects, we used the PHL algorithm [4, 5], which is known as the fastest algorithm to obtain D [2, 33]. Our algorithm is referred to as FANN-PHL hereafter. Table 1 summarizes the notations used in this study.

Table 1 Summary of notations

Full size table

We assume that all objects (POIs) are located on the vertexes of the road network as in previous studies [15, 33, 34]. If an object is located in the middle of an edge, the edge can be converted into two edges by dividing it at the object’s location. If an object exists outside the edges, a new edge to the object from the nearest vertex can be added in the road network.

Algorithm 1 describes the FANN-PHL algorithm, which has an overall structure similar to that of the previous FANN algorithms [21, 22, 33]. The input of the algorithm consists of a road network $\mathcal {R}$, a POI set $P ~ (\subseteq \mathcal {D})$, a query object set Q, a flexibility factor $\phi$, an aggregate function $\mathcal {G}$ (= max or sum), and an M-tree T. The algorithm returns the FANN object $p^*$, a query subset $Q^*_\phi$, and the FANN distance $g(p^*, Q^*_\phi )$, where $g(p^*, Q^*_\phi ) = \mathcal {G}\{ d(p^*, q_j), q_j \in Q^*_\phi \}$. Algorithm 1 is for the case in which the number of FANN objects k is 1, and the natural extension for the general case of $k \ge 1$ will be described later in this section.

We explain each line of Algorithm 1 in detail. In line 1, $\hat{p}^*$ denotes the FANN object that has been found until now during the execution of FANN-PHL, and its FANN distance $\hat{p}^*.g_\phi$ is initialized as $\infty$. H is a priority queue that includes the M-tree non-leaf node entries. In line 2, all entries of the root node of the M-tree are inserted into H. The while loop in lines 3 $\sim$ 18 is repeated until there is no entry remaining in H. In line 4, the entry that has the highest priority in H, i.e., the entry that has the highest possibility of including the final FANN object is extracted. Here, the possibility for a specific entry e is estimated using its FANN distance $e.g_\phi$, and the entry with the smallest $e.g_\phi$ distance among the entries in H is extracted. The $e.g_\phi$ distance can be obtained using Eq. (6) below.

In line 5, e.n is the sub-node for entry e, i.e., the root node of the subtree pointed by $e.T(O_r)$ in Fig. 1b. If the node e.n is a non-leaf node, in line 8, the possibility of including the final FANN object is estimated for each entry $e'$ in e.n; if there exists any possibility, $e'$ is inserted into H. To estimate the possibility, the FANN distance $e'.g_\phi$ for each entry $e'$ is calculated as follows:

$$\begin{aligned}&e'.g_\phi = \min \left\{ g(e', Q_\phi ), Q_\phi \subseteq Q \right\} , \end{aligned}$$

(6)

$$\begin{aligned}&g(e', Q_\phi ) = \mathcal {G}\{ D(e', q_i), q_i \in Q_\phi \}, \end{aligned}$$

(7)

where $D(e', q_i)$ is the distance between the spherical region for the node $e'.n$ and a query object $q_i$, and is defined as $D(e', q_i) = \max \{ D(e'.O_r, q_i) - e'.r(O_r), 0 \}$. Figure 2 shows $D(e', q_i)$ for two query objects $q_1$ and $q_2$. The FANN distance of an object included within the spherical region such as $q_2$ is defined as zero.

In Algorithm 1, e is an entry in a non-leaf node N. Every non-leaf node N in the M-tree has multiple sub-nodes n, and a corresponding entry e for each sub-node n is contained in N. That is, e.n is one of the sub-nodes of a non-leaf node N. $e'$ is an entry in the sub-node e.n. That is, the nodes containing entries e and $e'$ have the relationship of directly connected parent and child nodes.

In line 8, if the FANN distance $e'.g_\phi$ for a specific entry $e'$ is smaller than the FANN distance $\hat{p}^*.g_\phi$ of the FANN candidate object $\hat{p}^*$ that has been found until now, $e'$ is inserted into H together with $e'.g_\phi$. The $D(e'.O_r, q_i)$ distance required to obtain $e'.g_\phi$ is the shortest-path distance between two objects $e'.O_r$ and $q_i$, and its calculation is expensive as explained above. Hence, in line 7, the entries without the possibility of including the final FANN object are pruned off at a lower cost. For each entry $e'$, $e'.G_\phi$ is calculated as follows:

$$\begin{aligned}&e'.G_\phi = \min \left\{ G(e', Q_\phi ), Q_\phi \subseteq Q \right\} , \end{aligned}$$

(8)

$$\begin{aligned}&G(e', Q_\phi ) = \mathcal {G}\{ D_G(e', q_i), q_i \in Q_\phi \}, \end{aligned}$$

(9)

where $D_G(e', q_i)$ is the distance between the spherical region for a node $e'.n$ and a query object $q_i$ and is defined as $D_G(e', q_i) = | D(e.O_r, q_i) - D(e.O_r, e'.O_r) | - e'.r(O_r)$ (see Fig. 3a). The only difference from Eqs. (6) and (7) is that D is used in Eqs. (6) and (7) whereas $D_G$ is used in Eqs. (8) and (9). Since $e.O_r$ is the parent object in node n, which includes $e'$, $D(e.O_r, e'.O_r) = D(e'.O_p, e'.O_r)$ and is already stored in $e'$ together with $e'.r(O_r)$ (see Fig. 1b). The distance $D(e.O_r, q_i)$ can be used commonly for every $e'$ once it is calculated; therefore, it can reduce the calculations of D distances.

In line 5, if e.n is a leaf node, in line 14, the FANN distance $p.g_\phi$ is calculated as follows for each object p in e.n:

$$\begin{aligned}&p.g_\phi = \min \left\{ g(p, Q_\phi ), Q_\phi \subseteq Q \right\} , \end{aligned}$$

(10)

$$\begin{aligned}&g(p, Q_\phi ) = \mathcal {G}\{ D(p, q_i), q_i \in Q_\phi \}. \end{aligned}$$

(11)

Here, it should be checked if the object p belongs to the POI set P. If the FANN distance of p is smaller than that of the FANN candidate object $\hat{p}^*$, p is set as a new FANN candidate object. The cost of calculating the FANN distance of an object p is very high since the distance D between p and all query objects $q_i$ should be obtained. Hence, in line 13, as in line 7, the objects that are unlikely to be FANN objects are pruned off at a lower cost. That is, $p.G_\phi$ is calculated for each object p as follows:

$$\begin{aligned}&p.G_\phi = \min \left\{ G(p, Q_\phi ), Q_\phi \subseteq Q \right\} , \end{aligned}$$

(12)

$$\begin{aligned}&G(p, Q_\phi ) = \mathcal {G}\{ D_G(p, q_i), q_i \in Q_\phi \}, \end{aligned}$$

(13)

where $D_G(p, q_i) = | D(e.O_r, q_i) - D(e.O_r, p) |$ (see Fig. 3b). The only difference from Eqs. (10) and (11) is that D is used in Eqs. (10) and (11) whereas $D_G$ is used in Eqs. (12) and (13). Since $e.O_r$ is the parent object in node n, which includes p, $D(e.O_r, p) = D(O_p, p)$ and is already stored in the leaf node entry for p (see Fig. 1a). The calculations of D distances can be reduced since $D(e.O_r, q_i)$ is commonly used for every p once it is calculated. In line 19, the FANN candidate object $\hat{p}^*$ is returned as the final FANN object $p^*$. The following Lemma 1 proves that the FANN-PHL algorithm is correct.

Lemma 1

The FANN-PHL algorithm does not cause any false drop.

Proof

In line 8, since it holds that $D(p, q_i) \ge D(e', q_i) ~ (0\le i<M)$ for any object p included in the spherical region for $e'$, it holds that $g(p, Q_\phi ) \ge g(e', Q_\phi )$, i.e., $p.g_\phi \ge e'.g_\phi$ for any $Q_\phi$ (see Fig. 3a). If the condition in line 8 is not satisfied for the FANN candidate object $\hat{p}^*$, i.e., if $e'.g_\phi > \hat{p}^*.g_\phi$, it holds that $p.g_\phi > \hat{p}^*.g_\phi$ for any object p in $e'$. Therefore, $e'$ can be safely discarded.

In line 7, it is always true that $D(e', q_i) + e'.r(O_r) \ge | D(e.O_r, q_i) - D(e.O_r, e'.O_r) |$, i.e., $D(e', q_i) \ge | D(e.O_r, q_i) - D(e.O_r, e'.O_r) | - e'.r(O_r) = D_G(e', q_i) ~ (0\le i<M)$ (see Fig. 3a). Hence, it holds that $g(e', Q_\phi ) \ge G(e', Q_\phi )$, i.e., $e'.g_\phi \ge e'.G_\phi$ for any $Q_\phi$. If the condition in line 7 is not satisfied, i.e., if $e'.G_\phi > \hat{p}^*.g_\phi$, it holds that $e'.g_\phi > \hat{p}^*.g_\phi$. Therefore, $e'$ can be discarded safely based on the proof for line 8.

In line 13, it is always true that $D(p, q_i) \ge | D(e.O_r, q_i) - D(e.O_r, p) | = D_G(p, q_i) ~ (0\le i<M)$ (see Fig. 3b). Hence, for any $Q_\phi$, it holds that $g(p, Q_\phi ) \ge G(p, Q_\phi )$, i.e., $p.g_\phi \ge p.G_\phi$. If the condition in line 13 is not satisfied, i.e., if $p.G_\phi > \hat{p}^*.g_\phi$, it holds that $p.g_\phi > \hat{p}^*.g_\phi$, and therefore p can also be discarded safely.

In conclusion, considering all the aforementioned proofs together, the FANN-PHL algorithm in Algorithm 1 does not cause any false drop. $\square$

Algorithm 1 applies to the case where the number of FANN objects k is 1, and it can be extended to the general case of $k \ge 1$ as follows. First, an array K is allocated to store k FANN result objects and initialized as $K_i.g_\phi = \infty ~ (0 \le i < k)$. The FANN candidate objects in K are always ordered by their respective $K_i.g_\phi$ values. In lines 7, 8, 13, and 14 in Algorithm 1, comparisons are made with $K_{k-1}.g_\phi$ instead of $\hat{p}^*.g_\phi$. When the condition in line 14 is satisfied, a new object p is inserted into K, and the previous object $K_{k-1}$ is removed. Finally, the array K is returned in line 19.

4 Experimental evaluation

In this section, we compare the search performance of our FANN-PHL algorithm with that of the IER-kNN algorithm [33] through a series of experiments using real road network datasets. The platform is a workstation with AMD 3970X CPU, 128GB memory, and 1.2TB SSD. We implemented both FANN-PHL and IER-kNN in C/C++. In IER-kNN [33], several methods were compared to find the FANN distance between an object p and a set of query objects Q, and IER-PHL using PHL [4, 5] was shown to be the most efficient through experiments. In our experiments, the performance of our algorithm is compared with IER-kNN using IER-PHL.

The datasets used in the experiments are real road networks of five regions in the U.S. These datasets have been used in the 9th DIMACS Implementation Challenge − Shortest Paths^{Footnote 1} and many previous studies [2, 33]. Table 2 summarizes the datasets used in the experiments, where each dataset is a graph consisting of a set of vertices and a set of undirected edges. Each vertex represents a point (i.e., an object) in the road network, and each edge represents the road segment directly connecting two vertices. Since these datasets contain noise such as self-loop edges and unconnected graph segments [33], we performed data pre-processing to remove them. To quickly obtain the shortest-path distance D between two objects (vertices), we used the original C/C++ source code written by the creators of the PHL algorithm^{Footnote 2}. Table 3 summarizes the sizes of each road network dataset and the corresponding M-tree and R-tree. The units are all in MB. The node size of all indexes was fixed as 4KB, which is most widely used in diverse database applications. Table 4 summarizes the parameters to be considered in the experiments, where the default parameter values are given in parentheses.

Table 2 Road network datasets

Full size table

Table 3 Road network dataset and index sizes (in MB)

Full size table

Table 4 Experiment parameters

Full size table

In the first experiment, we compared the execution time needed for FANN search and the number of index node accesses for all road network datasets listed in Table 2. All the other parameters were set to the default values in Table 4. Figure 4 shows the results of the first experiment; the values in this figure are the averages of the results obtained by 1000 randomly generated query sets. The results for the aggregate functions $\mathcal {G}$ = max and sum were represented by adding “MAX” and “SUM” to the names of the two algorithms, respectively, e.g., FANN-PHL-MAX and FANN-PHL-SUM. As shown in this figure, both FANN search algorithms showed similar trends in the execution time and the number of index node accesses. The number of objects included in the query region of the same size increased with the size of the road network. Therefore, the number of distance calculations to them and the execution time also increased. In the first experiment, FANN-PHL consistently outperformed IER-kNN with the improvement ratio of up to 4.75 times for the W dataset and $\mathcal {G}$ = max.

From the second through the last experiments, we use NW (Northwest USA) dataset as the default. As given in Table 4, the default values of M, k, $\phi$, and C are 256, 1, 0.5, and 0.10, respectively. In the second experiment, we compared the FANN search performance while changing the number of the nearest objects k, and the results are shown in Fig. 5. For both FANN-PHL and IER-kNN, since the pruning bound increased with k, more index nodes were visited, and the execution time also increased. In this experiment as well, FANN-PHL consistently outperformed IER-kNN with a performance improvement of up to 2.40 times for k = 1 and $\mathcal {G}$ = max.

In the third experiment, we compared the performance of FANN search for various flexibility factor $\phi$ values, and the results are shown in Fig. 6. It can be observed that, as $\phi$ increased, the execution time and the number of index node accesses of IER-kNN increased. This is because, for higher $\phi$, the FANN distance $\hat{p}^*.g_\phi$ of the FANN candidate object $\hat{p}^*$ becomes larger, and more R-tree nodes are visited. In contrast, for FANN-PHL, even with an increase in $\phi$, the execution time and the number of M-tree node accesses decreased. This is because, as $\phi$ increases in line 8 in Algorithm 1, $e'.g_\phi$ for an entry $e'$ also increases faster than $\hat{p}^*.g_\phi$. When calculating $e'.g_\phi$, $Q_\phi$ is composed of the query objects closest to $e'$ among those in Q, so for a smaller $\phi$, it is likely that more query objects $q_i ~ (\in Q_\phi )$ are included in the spherical region of $e'$. Since we have $D(e', q_i) = 0$ for these $q_i$ as $q_2$ in Fig. 2, $e'.g_\phi$ also becomes zero or very close to zero. However, for a larger $\phi$, the probability decreases, and it is more likely that $e'.g_\phi > \hat{p}^*.g_\phi$. Therefore, a smaller number of entries $e'$ are added into H as $\phi$ increases. In this experiment as well, FANN-PHL consistently showed a better performance than IER-kNN with a performance improvement of up to 6.92 times for $\phi$ = 1.0 and $\mathcal {G}$ = max.

In the fourth experiment, we compared the performance of FANN search while changing the coverage ratio C of query objects, where C denotes the ratio of the minimum area including all query objects to the area occupied by all road network objects. Figure 7 shows the experimental results. For higher C, the number of index nodes included in the query object area increases, and the execution time becomes larger. In this experiment, FANN-PHL consistently performed better than IER-kNN with a performance improvement of up to 3.06 times for C = 0.2 and $\mathcal {G}$ = max.

In the final experiment, we compared the performance of FANN search while changing the number of query objects M, and the results are shown in Fig. 8. For both algorithms, we found that, as M increased, the number of index node accesses remained almost constant while the execution time increased linearly. This is because, even though M increases, there are no noticeable variations in $\hat{p}^*.g_\phi$ and $e'.g_\phi$ since the area of query objects remains similar. The number of M-tree nodes accessed by FANN-PHL was much smaller than the number of R-tree nodes accessed by IER-kNN (Fig. 8b). Meanwhile, as M increased, the number of calculations of distance D increased linearly for both algorithms (Fig. 8c). This is because the actual distance D to all M query objects $q_i$ should be calculated to obtain $\hat{p}^*.g_\phi$ and $e'.g_\phi$. Owing to these two factors, the execution time of both algorithms increased linearly with M. In this experiment as well, FANN-PHL consistently outperformed IER-kNN with a performance improvement of up to 2.67 times for M = 64 and $\mathcal {G}$ = max.

5 Conclusions

This study proposed the FANN-PHL algorithm for efficient exact k-FANN search using the M-tree [13]. The state-of-the-art IER-kNN algorithm [33] used the R-tree [24] and pruned off the index nodes that are unlikely to include the final result object using the Euclidean distances. However, IER-kNN made many unnecessary accesses to index nodes and thus performed many calculations of the shortest-path distances to the objects included in the unnecessary nodes since the Euclidean distances are significantly different from the actual shortest-path distances between objects in road networks. Our FANN-PHL algorithm can prune off the index nodes more accurately than IER-kNN by using the M-tree, which is constructed based on the actual distances between objects, and can also dramatically reduce the calculations of the shortest-path distances. To the best of our knowledge, FANN-PHL is the first exact k-FANN algorithm that uses the M-tree. We proved that our algorithm does not cause any false drop. Through a series of experiments using various real road network datasets, we demonstrated that FANN-PHL consistently outperformed IER-kNN for all datasets and parameters with a performance improvement of up to 6.92 times.

Notes

References

Abdelgawad L, Kluegl P, Genc E, Falkner S, Hutter F (2019) Optimizing Neural Networks for Patent Classification. In: Proc. of European Conf. on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Würzburg, Germany, pp. 688–703, Sept
Abeywickrama T, Cheema M. A., Taniar D (2016) k-nearest neighbors on road networks: a journey in experimentation and in-memory implementation. In: Proc. of the VLDB Endowment (PVLDB). 9(6): 492–503
Abeywickrama T, Cheema M. A., Storandt S (2020) Hierarchical Graph Traversal for Aggregate k Nearest Neighbors Search in Road Networks. In: Proc. of Int’l Conf. on Automated Planning and Scheduling (ICAPS). Nancy, France, pp. 2–10
Abraham I, Delling D, Goldberg A. V., Werneck R. F. (2011) A hub-based labeling algorithm for shortest paths in road networks. In: Proc. of Int’l Conf. on Experimental algorithms (SEA), Crete, Greece, pp. 230–241
Akiba T, Iwata Y, Kawarabayashi K, Kawata Y (2014) Fast shortest-path distance queries on road networks by pruned highway labeling. In: Proc. of Meeting on Algorithm Engineering & Experiments (ALENEX), Portland, Oregon, USA, pp. 147–154
Akiba T, Hayashi T, Nori N, Iwata Y, Yoshida Y (2015) Efficient Top-$k$ Shortest-path distance queries on large networks by pruned landmark labeling. In: Proc. of AAAI Conf. on Artificial Intelligence, Austin, Texas, USA, pp. 2–8
Bachmann JP (2019) The SuperM-Tree: indexing metric spaces with sized objects. arXiv:1901.11453
Barrientos RJ, Riquelme JA, Hernández-García R, Navarro CA, Soto-Silva W (2021) Fast kNN query processing over a multi-node GPU environment. J Supercomput 78:3045
Article Google Scholar
Čech P, Lokoč J, Silva YN (2020) Pivot-based approximate k-NN similarity joins for big high-dimensional data. Inf Syst 87:1–18
Article Google Scholar
Chen L, Xu S, Zhu L, Zhang J, Lei X, Yang G (2020) A deep learning based method for extracting semantic information from patent documents. Scientometrics 125(1):289–312
Article Google Scholar
Chen Z, Yao B, Wang Z. J., Gao X, Shang S, Ma S, Guo M (2020) Flexible Aggregate Nearest Neighbor Queries and its Keyword-Aware Variant on Road Networks. In: IEEE Trans. on Knowledge and Data Engineering (TKDE), Early Access
Chung M, Loh W-K (2021) $\alpha$-Probabilistic flexible aggregate nearest neighbor search in road networks using landmark multidimensional scaling. J Supercomput 77(2):2138–2153
Article Google Scholar
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proc. of the Int’l Conf. on Very Large Data Bases (VLDB), Athens, Greece, pp. 426–435
Gao Y, Qin X, Zheng B, Chen G (2015) Efficient reverse top-$k$ Boolean spatial keyword queries on road networks. IEEE Trans Knowledge Data Eng 27(5):1205–1218
Article Google Scholar
Ioup E, Shaw K, Sample J, Abdelguerfi M (2007) Efficient AKNN spatial network queries using the M-Tree. In: Proc. of ACM Int’l Symp. on Advances in Geographic Information Systems (GIS). Seattle, Washington, USA, Article 46, pp. 1–4
Kriegel H.-P., Kröger P, Kunath P, Renz M, Schmidt T (2007) Proximity queries in large traffic networks. In: Proc. of ACM Int’l Symp. on Advances in Geographic Information Systems (GIS), Seattle, Washington, USA, Article 21, pp. 1–8
Kriegel H.-P., Kröger P, Renz M, Schmidt T (2008) Hierarchical graph embedding for efficient query processing in very large traffic networks. In: Proc. of Int’l Conf. on Scientific and Statistical Database Management (SSDBM), Hong Kong, China, 150–167
Lee KCK, Lee W, Zheng B, Tian Y (2012) ROAD: a new spatial object search framework for road networks. IEEE Trans Knowledge Data Eng 24(3):547–560
Article Google Scholar
Lee J, Park S (2021) Efficient methods for finding an optimal network location for travel planning. J Supercomput 77:12561
Article Google Scholar
Lee W, Song JJ, Lee CC, Jo T-C, Lee JJH (2021) Graph threshold algorithm. J Supercomput 77(9):9827–9847
Article Google Scholar
Li Y, Li F, Yi K, Yao B, Wang M (2011) Flexible aggregate similarity search. In: Proc. of ACM Int’l Conf. on Management of Data (SIGMOD), Athens, Greece, pp. 1009–1020
Li F, Yi K, Tao Y, Yao B, Li Y, Xie D, Wang M (2016) Exact and approximate flexible aggregate similarity search. VLDB J 25(3):317–338
Article Google Scholar
Li S, Li B, Yu J, Zhang L, Zhang A, Cai K (2021) Probabilistic threshold $k$-ANN query method based on uncertain voronoi diagram in internet of vehicles. IEEE Trans Intell Transp Syst 22(6):3592–3602
Article Google Scholar
Manolopoulos Y, Nanopoulos A, Papadopoulos AN, Theodoridis Y (2005) R-Trees: theory and applications. Springer
Miao X, Gao Y, Mai G, Chen G, Li Q (2020) On efficiently monitoring continuous aggregate $k$ nearest neighbors in road networks. IEEE Trans Mobile Comput 19(7):1664–1676
Article Google Scholar
Ouyang D, Wen D, Qin L, Chang L, Zhang Y, Lin X (2020) Progressive Top-K Nearest Neighbors Search in large road networks. In: Proc. of ACM SIGMOD Int’l Conf. on Management of Data (SIGMOD). Portland, Oregon, USA, 1781–1795
Papadias D, Zhang J, Mamoulis N, TaoQuery Y (2003) Processing in spatial network databases. In: Proc. of Int’l Conf. on Very Large Data Bases (VLDB), Berlin, Germany, pp. 802–813
Roudsari AH, Afshar J, Lee W, Lee S (2022) PatentNet: multi-label classification of patent documents using deep learning based language understanding. Scientometrics 127(1):207–231
Article Google Scholar
Shalaby M, Stutzki J, Schubert M, Gunnemann S (2018) An LSTM approach to patent classification based on fixed hierarchy vectors. In: Proc. of SIAM Int’l Conf. on Data Mining (SDM), San Diego, CA, USA, pp. 495–503
Shaw K, Ioup E, Sample J, Abdelguerfi M, Tabone O (2007) Efficient approximation of spatial network queries using the m-tree with road network embedding. In: Proc. of Int’l Conf. on Scientific and Statistical Database Management (SSDBM), Banff, Canada, pp. 11–11
de Silva V, Tenenbaum JB (2003) Global versus local methods in nonlinear dimensionality reduction. Adv Neural Inf Process Syst 15:721–728
Google Scholar
de Silva V, Tenenbaum JB (2004) Sparse multidimensional scaling using landmark points, Technical report, Vol. 120, Stanford University
Yao B, Chen Z, Gao X, Shang S, Ma S, Guo M (2018) Flexible aggregate nearest neighbor queries in road networks. In: Proc. of IEEE Int’l Conf. on Data Engineering (ICDE), Paris, France, pp. 761–772
Yiu ML, Mamoulis N, Papadias D (2005) Aggregate nearest neighbor queries in road networks. Trans Knowledge Data Eng 17(6):820–833
Article Google Scholar
Zhao J, Gao Y, Ma C, Jin P, Wen S (2020) On efficiently diversified top-$k$ geo-social keyword query processing in road networks. Inf Sci 512:813–829
Article Google Scholar
Zhong R, Li G, Tan K, Zhou L, Gong Z (2015) G-Tree: an efficient and scalable index for spatial search on road networks. IEEE Trans Knowledge Data Eng 27(8):2175–2189
Article Google Scholar
Zhou Y, Zeng J (2015) Massively parallel a* search on a GPU. In: Proc. of AAAI Conf. on Artificial Intelligence, Austin, Texas, USA, pp. 1248–1254

Download references

Author information

Authors and Affiliations

School of Computing, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea
Moonyoung Chung & Soon J. Hyun
Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), Daejeon, 34129, Republic of Korea
Moonyoung Chung
School of Computing, Gachon University, Songnam, 13120, Republic of Korea
Woong-Kee Loh

Authors

Moonyoung Chung
View author publications
You can also search for this author in PubMed Google Scholar
Soon J. Hyun
View author publications
You can also search for this author in PubMed Google Scholar
Woong-Kee Loh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Woong-Kee Loh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00073, Development of Cloud-Edge based City-Traffic Brain Technology). This work was also supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2021R1A2C1014432)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chung, M., Hyun, S.J. & Loh, WK. Efficient exact k-flexible aggregate nearest neighbor search in road networks using the M-tree. J Supercomput 78, 16286–16302 (2022). https://doi.org/10.1007/s11227-022-04496-2

Download citation

Accepted: 01 April 2022
Published: 04 May 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11227-022-04496-2

Efficient exact k-flexible aggregate nearest neighbor search in road networks using the M-tree

Abstract

Similar content being viewed by others

A Survey of Traffic Prediction: from Spatio-Temporal Data to Intelligent Transportation

A Practical Guide to an Open-Source Map-Matching Approach for Big GPS Data

Fast and Memory-Efficient Approximate Minimum Spanning Tree Generation for Large Datasets

1 Introduction

2 Related work