1 Introduction

This study proposes an efficient k-flexible aggregate nearest neighbor (FANN) search algorithm \((k \ge 1)\). The FANN search is an extension of the aggregate nearest neighbor (ANN) search, which is also an extension of the traditional nearest neighbor (NN) search. The NN search, which finds the object closest to the given query object q among the objects in a dataset \(\mathcal {D}\), is an important subject pursued in many applications in various domains [2, 6, 16, 30]. The ANN search [15, 25, 34], which extends the NN search by introducing a query set Q including \(M ~ (\ge 1)\) query objects \(q_j ~ (0 \le j < M)\), finds an object \(p^*\) that satisfies the following Eq. (1):

$$\begin{aligned} p^* = {\mathop {\mathrm{argmin}}\limits _{p_i \in \mathcal {D}}} \left\{ \mathcal {G}\left\{ d(p_i, q_j), q_j \in Q \right\} \right\} , \end{aligned}$$
(1)

where \(\mathcal {G}\) denotes an aggregate function such as max and sum, and d() denotes the distance between two objects. An example of applying ANN search is to find an optimal place for a meeting of M members.

The FANN search [21, 22, 33], which extends ANN search by introducing a flexibility factor \(\phi ~ (0 < \phi \le 1)\), finds an object \(p^*\) that satisfies the following Eq. (2):

$$\begin{aligned} p^* = {\mathop {\mathrm{argmin}}\limits _{p_i \in \mathcal {D}}} \left\{ \mathcal {G}\left\{ d(p_i, q_j), q_j \in Q_\phi \right\} \right\} , \end{aligned}$$
(2)

where \(Q_\phi\) denotes any subset of Q of \(\phi M\) size. An example of FANN search is to find an optimal place for a meeting of \(\phi M\) members, the minimum quorum of M members. The FANN search cannot be solved simply by running an ANN search algorithm for every possible \(Q_\phi\). For example, in the case of M = 256 and \(\phi\) = 0.5, ANN search must be performed as much as \(5.769 \times 10^{75}\) times. In this study, the target of FANN search is the objects in a points-of-interest (POIs) set \(P ~ (\subseteq \mathcal {D})\), e.g., hospitals and restaurants, instead of the whole dataset \(\mathcal {D}\).

The ANN and FANN searches can be used in various applications [15, 21, 22, 33, 34]. Consider having a meeting of M members \(q_0, \dots , q_{M-1}\). In case we want a quick meeting, we will search for \(p^*~(\in P)\) that satisfies Eq. (1) for the meeting place. In case we want to minimize the total cost of all members gathering at the meeting place, we will search for \(p^*\) for \(\mathcal {G}\) = sum. If we can have a meeting with only \(\phi M\) members instead of all M members, we will search for \(p^*\) that satisfies Eq. (2) instead of Eq. (1). The ANN and FANN searches can also be used in multimedia applications [21, 22]. In a multimedia database, when we want to find objects (e.g., images) similar to all of given M objects, we can perform an ANN search. However, if some of the given M objects are different from the rest, the results of the ANN search may be obtained different from expectations. In this case, accurate query results can be obtained by performing FANN searches by appropriately adjusting the value of \(\phi\) to remove such unexpected search results. The ANN and FANN searches can also be used for patent analysis [1, 10, 28, 29]. Rather than conducting a patent search individually for each case, a more comprehensive and efficient patent analysis is possible by performing a search simultaneously for multiple related cases.

The existing ANN and FANN search algorithms have been studied separately for Euclidean spaces and road networks. A road network is represented with a graph data structure, and the distance between two objects is defined as the shortest-path distance between them [2, 33, 34]. Since the calculation of the shortest-path distance has a much higher complexity than that of the Euclidean distance [12, 17, 37], ANN and FANN search algorithms in road networks should minimize the calculations of the shortest-path distances. Yao et al. [33] proposed a few algorithms for exact FANN search in road networks, and among them, the IER-kNN algorithm showed the highest performance. It used the R-tree [24] and pruned off the nodes that are unlikely to include the final result objects, thus reducing the calculations of the shortest-path distances for the objects in the pruned nodes. Nevertheless, when deciding whether to prune a specific node, IER-kNN accessed many unnecessary nodes since the decision is made using the Euclidean distances, which are significantly different from the actual shortest-path distances, and thus performs many shortest-path distance calculations for objects included in the unnecessary nodes.

This study proposes an efficient exact k-FANN search algorithm using the M-tree [13] and proves that the proposed algorithm does not cause any false drop. While the R-tree is an index structure for objects in a Euclidean space, the M-tree is constructed for a dataset in a metric space, where a distance function between objects is given instead of their actual coordinates. The road network can be mapped into a metric space [15, 33], and the M-tree is constructed using the actual shortest-path distances between objects in road networks. Therefore, our algorithm can prune the index nodes more accurately than the state-of-the-art IER-kNN algorithm and can dramatically reduce the calculations of the shortest-path distances. To the best of our knowledge, our algorithm is the first exact FANN algorithm that uses the M-tree. The performance of our algorithm is compared with that of IER-kNN using various real road network datasets. The experimental result demonstrated that our algorithm consistently outperformed IER-kNN for all datasets and parameters, with a performance improvement of up to 6.92 times.

This paper is organized as follows. Section 2 briefly explains the structure of the M-tree and the existing FANN search algorithms. Section 3 describes our algorithm in detail. Section 4 compares the search performance for various real road network datasets and parameters. Finally, Sect. 5 concludes this study.

2 Related work

In this section, we discuss various previous NN, ANN, and FANN algorithms. With the recent spread of ubiquitous mobile devices, the demand for efficient k-NN search in road networks has increased. Abeywickrama et al. [2] evaluated the performance of various existing k-NN algorithms including Incremental Network Expansion (INE) [27], Incremental Euclidean Restriction (IER) [27], Route Overlay and Association Directory (ROAD) [18], and G-tree [36]. They demonstrated as an experimental result using synthetic and real road network datasets that the best performance was achieved with the combination of the previously neglected IER algorithm and pruned highway labeling (PHL) algorithm [4, 5]. Shaw et al. [30] presented an approximate k-NN algorithm using Road Network Embedding (RNE), which maps objects on a road network to a p-dimensional Euclidean space. The algorithm stored the mapped objects in the M-tree and showed the search performance superior to the existing INE [27] algorithm.

Gao et al. [14] dealt with the reverse k-NN (RkNN) problem in road networks. They presented an algorithm based on a heuristic filter-and-refinement framework that simultaneously considers spatial and textual information and demonstrated its efficiency using synthetic and real datasets. Zhao et al. [35] dealt with the problem of diversified top-k geo-social keyword (DkGSK) query that considers spatial, social, and textual constraints between the query and data objects. They considered not only the relevance but also the diversity of the query result in order to enhance the quality of the result. They showed that the problem was NP-hard and proposed an exact algorithm based on several heuristics and an approximate algorithm, whose efficiency was demonstrated using actual datasets.

Ouyang et al. [26] proposed a parameter-free tree index, whose size is bounded by O(nh), where n is the number of vertices and h is the longest depth of the tree, for k-NN queries in road networks. They proposed a progressive search algorithm, which returns i-th NN within O(ih) time \((1 \le i \le k)\), using the index and showed that the algorithm outperformed two existing state-of-the-art algorithms. Barrientos et al. [8] proposed a k-NN algorithm in a multi-node/multi-GPU environment and achieved a speed-up of up to 1,843 times over the existing state-of-the-art GPU-based algorithm using a 5-nodes/20-GPUs platform.

Ioup et al. [15] proposed an ANN search algorithm in road networks using the M-tree [13]. However, this algorithm only returns an approximate result, and the error ratio of the search result is unknown. Miao et al. [25] dealt with the continuous k-ANN (CAkNN) problem in dynamic road networks, where the locations of data and query objects and the edge weights are changing. They defined partial distance matrix data structure that contains only data objects closer than the safe distance r from each query object, where r is not greater than the aggregate distance of the k-th candidate ANN object. They showed that their algorithm was superior to the existing algorithm that assumes static query objects through experiments using actual road network datasets.

Abeywickrama et al. [3] introduced Subgraph-Landmark Tree (SL-Tree) and Compacted Object-Landmark Tree (COLT) based on the lower bound of the distance between two objects in a road network. The lower bound is obtained using m randomly chosen “landmarks.” They proposed a heuristic k-ANN search algorithm using the indexes and achieved up to two to three orders of magnitude improvement. Li et al. [23] dealt with the k-ANN problem in an environment where the object locations in a road network are uncertain. They represented the probabilistic regions of uncertain objects with a Voronoi diagram and proposed Probabilistic Threshold k-ANN (PTkANN) algorithm based on it. Lee and Park [19] dealt with optimal network location for travel planning (ONLTP) query, which minimizes the sum of ANN distance and \(d(p^*, v)\), where \(p^*\) is the ANN object and v is a travel destination.

Li et al. [21, 22] addressed the FANN search problem in a Euclidean space and proposed algorithms using the R-tree and a list data structure. The R-tree-based algorithm estimates the FANN distance based on the Euclidean distance between \(\phi M\) query objects that are nearest to the MBR of each R-tree node and determines whether to prune the node based on the estimated distance. The list-based algorithm finds the final FANN object while gradually constructing the nearest object list for each query object \(q_i\). Li et al. [21, 22] conducted various experiments for the algorithms and showed that the R-tree-based algorithm always had a higher search performance.

The FANN search problem in road networks was addressed by Yao et al. [33]. They proposed the Dijkstra-based algorithm, R-List algorithm, and R-tree-based IER-kNN algorithm. In addition, they presented an exact algorithm that does not require an index for \(\mathcal {G}\) = max. They experimentally showed that IER-kNN had the best performance for all parameters and road network datasets. However, IER-kNN accessed many unnecessary R-tree nodes and performed many unnecessary shortest-path distance calculations for objects included in the unnecessary nodes. The algorithms that did not use an index showed a much lower search performance than the algorithms using an index. Chen et al. [11] addressed the FANN search problem that took keyword similarity into account in road networks. They defined a new distance function based on both the aggregate of distances to query objects \(q_i ~ (\in Q_\phi )\) and keyword similarity. They presented algorithms (denoted as KFANN) by extending the Dijkstra-based algorithm, R-List algorithm, and IER-kNN previously proposed by Yao et al. [33].

Chung and Loh [12] proposed an efficient \(\alpha\)-probabilistic FANN algorithm named IER-LMDS using the landmark multidimensional scaling (LMDS) [31, 32]. The LMDS converts objects in a road network into those in a Euclidean space so that the relative distances between them are maintained as much as possible. Accordingly, the converted objects can be stored in an R-tree for an efficient FANN search. However, IER-LMDS returns only search results satisfying the given search probability \(\alpha ~(0<\alpha <1)\) and may cause false positive and/or false negative. On the other hand, IER-kNN [33] and the algorithms proposed in this study are all exact k-FANN algorithms, so we do not compare with IER-LMDS.

Bachman [7] proposed the SuperM-tree for approximate subsequence or subset search. Examples of subsequences or subsets include partial gene sequences or movie scenes. Čech et al. [9] proposed a k-NN similarity join algorithm for large high-dimensional datasets. They presented distributed algorithms on the Apache Hadoop and Spark platforms based on the metric space similarity defined using reference data objects (pivots). Lee et al. [20] presented a top-k subgraph search algorithm for a given set of query keywords Q. The algorithm is an enhancement of the threshold algorithm (TA), which was previously applied to non-graph structures. The efficiency of the algorithm was verified through experiments using actual and synthetic datasets.

3 FANN-PHL: proposed k-FANN algorithm

The M-tree [13] is a balanced tree index structure similar to the R-tree [24]. While the region for a node of the R-tree is a minimum bounding rectangle (MBR) including all entries in the corresponding node, the region for a node of the M-tree is a sphere defined by a central object (or parent object) and radius. Figure 1a shows the structure of an entry in an M-tree leaf node. A leaf entry corresponds to an object in a dataset. In Fig. 1a, \(O_i\) is an object, \(oid(O_i)\) is the object ID of \(O_i\), and \(d(O_i,O_p)\) is the distance between \(O_i\) and the parent object \(O_p\). The parent object \(O_p\) is a central object that represents a leaf node L; among all the objects \(O_i\) in L, \(O_p\) is chosen such that it satisfies the following Eq. (3):

$$\begin{aligned} O_p = {\mathop {\mathrm{argmin}}\limits _{O_i \in L}} \left\{ \max \left\{ d(O_i, O_j), O_j \in L \right\} \right\} . \end{aligned}$$
(3)

Figure 1b shows the structure of an entry in an M-tree non-leaf node N. A non-leaf entry corresponds to a sub-node n of N. In Fig. 1b, \(O_r\) is called the routing object and set as the parent object of n. \(r(O_r)\) is the radius of the spherical region of n, \(T(O_r)\) is a pointer to the subtree rooted by n, and \(d(O_r, O_p)\) is the distance between \(O_r\) and \(O_p\) of N. The parent object \(O_p\) is chosen as the routing object \(e_p.O_r\) of the entry \(e_p\) such that it satisfies the following Eq. (4) among the entries \(e_i\) in N:

$$\begin{aligned}&e_p = {\mathop {\mathrm{argmin}}\limits _{e_i \in N}} \left\{ \max \left\{ d(e_i, e_j), e_j \in N \right\} \right\} , \end{aligned}$$
(4)
$$\begin{aligned}&d(e_i, e_j) = d(e_i.O_r, e_j.O_r) + e_i.r(O_r) + e_j.r(O_r). \end{aligned}$$
(5)
Fig. 1
figure 1

Structures of M-tree node entries

As a condition for using the M-tree, the shortest-path distance function d() between two objects must satisfy the triangle inequality [13]. That is, for any objects \(o_1, o_2\), and \(o_3\), the inequality \(d(o_1, o_2) \le d(o_1, o_3) + d(o_3, o_2)\) should be satisfied. This can be proved simply as follows. If this condition is not satisfied, that is, if \(d(o_1, o_2) > d(o_1, o_3) + d(o_3, o_2)\), it indicates that the shortest-path distance between two objects \(o_1\) and \(o_2\) is longer than the distance of the path through the object \(o_3\), which is a contradiction. Therefore, the inequality is satisfied.

In this section, we explain our exact k-FANN search algorithm that uses the M-tree constructed using the actual shortest-path distance D between two objects in a road network. To obtain the distance D between two objects, we used the PHL algorithm [4, 5], which is known as the fastest algorithm to obtain D [2, 33]. Our algorithm is referred to as FANN-PHL hereafter. Table 1 summarizes the notations used in this study.

Table 1 Summary of notations

We assume that all objects (POIs) are located on the vertexes of the road network as in previous studies [15, 33, 34]. If an object is located in the middle of an edge, the edge can be converted into two edges by dividing it at the object’s location. If an object exists outside the edges, a new edge to the object from the nearest vertex can be added in the road network.

Algorithm 1 describes the FANN-PHL algorithm, which has an overall structure similar to that of the previous FANN algorithms [21, 22, 33]. The input of the algorithm consists of a road network \(\mathcal {R}\), a POI set \(P ~ (\subseteq \mathcal {D})\), a query object set Q, a flexibility factor \(\phi\), an aggregate function \(\mathcal {G}\) (= max or sum), and an M-tree T. The algorithm returns the FANN object \(p^*\), a query subset \(Q^*_\phi\), and the FANN distance \(g(p^*, Q^*_\phi )\), where \(g(p^*, Q^*_\phi ) = \mathcal {G}\{ d(p^*, q_j), q_j \in Q^*_\phi \}\). Algorithm 1 is for the case in which the number of FANN objects k is 1, and the natural extension for the general case of \(k \ge 1\) will be described later in this section.

figure a

We explain each line of Algorithm 1 in detail. In line 1, \(\hat{p}^*\) denotes the FANN object that has been found until now during the execution of FANN-PHL, and its FANN distance \(\hat{p}^*.g_\phi\) is initialized as \(\infty\). H is a priority queue that includes the M-tree non-leaf node entries. In line 2, all entries of the root node of the M-tree are inserted into H. The while loop in lines 3 \(\sim\) 18 is repeated until there is no entry remaining in H. In line 4, the entry that has the highest priority in H, i.e., the entry that has the highest possibility of including the final FANN object is extracted. Here, the possibility for a specific entry e is estimated using its FANN distance \(e.g_\phi\), and the entry with the smallest \(e.g_\phi\) distance among the entries in H is extracted. The \(e.g_\phi\) distance can be obtained using Eq. (6) below.

In line 5, e.n is the sub-node for entry e, i.e., the root node of the subtree pointed by \(e.T(O_r)\) in Fig. 1b. If the node e.n is a non-leaf node, in line 8, the possibility of including the final FANN object is estimated for each entry \(e'\) in e.n; if there exists any possibility, \(e'\) is inserted into H. To estimate the possibility, the FANN distance \(e'.g_\phi\) for each entry \(e'\) is calculated as follows:

$$\begin{aligned}&e'.g_\phi = \min \left\{ g(e', Q_\phi ), Q_\phi \subseteq Q \right\} , \end{aligned}$$
(6)
$$\begin{aligned}&g(e', Q_\phi ) = \mathcal {G}\{ D(e', q_i), q_i \in Q_\phi \}, \end{aligned}$$
(7)

where \(D(e', q_i)\) is the distance between the spherical region for the node \(e'.n\) and a query object \(q_i\), and is defined as \(D(e', q_i) = \max \{ D(e'.O_r, q_i) - e'.r(O_r), 0 \}\). Figure 2 shows \(D(e', q_i)\) for two query objects \(q_1\) and \(q_2\). The FANN distance of an object included within the spherical region such as \(q_2\) is defined as zero.

Fig. 2
figure 2

Distances D between an entry \(e'\) and query objects \(q_1\) and \(q_2\)

In Algorithm 1, e is an entry in a non-leaf node N. Every non-leaf node N in the M-tree has multiple sub-nodes n, and a corresponding entry e for each sub-node n is contained in N. That is, e.n is one of the sub-nodes of a non-leaf node N. \(e'\) is an entry in the sub-node e.n. That is, the nodes containing entries e and \(e'\) have the relationship of directly connected parent and child nodes.

In line 8, if the FANN distance \(e'.g_\phi\) for a specific entry \(e'\) is smaller than the FANN distance \(\hat{p}^*.g_\phi\) of the FANN candidate object \(\hat{p}^*\) that has been found until now, \(e'\) is inserted into H together with \(e'.g_\phi\). The \(D(e'.O_r, q_i)\) distance required to obtain \(e'.g_\phi\) is the shortest-path distance between two objects \(e'.O_r\) and \(q_i\), and its calculation is expensive as explained above. Hence, in line 7, the entries without the possibility of including the final FANN object are pruned off at a lower cost. For each entry \(e'\), \(e'.G_\phi\) is calculated as follows:

$$\begin{aligned}&e'.G_\phi = \min \left\{ G(e', Q_\phi ), Q_\phi \subseteq Q \right\} , \end{aligned}$$
(8)
$$\begin{aligned}&G(e', Q_\phi ) = \mathcal {G}\{ D_G(e', q_i), q_i \in Q_\phi \}, \end{aligned}$$
(9)

where \(D_G(e', q_i)\) is the distance between the spherical region for a node \(e'.n\) and a query object \(q_i\) and is defined as \(D_G(e', q_i) = | D(e.O_r, q_i) - D(e.O_r, e'.O_r) | - e'.r(O_r)\) (see Fig. 3a). The only difference from Eqs. (6) and (7) is that D is used in Eqs. (6) and (7) whereas \(D_G\) is used in Eqs. (8) and (9). Since \(e.O_r\) is the parent object in node n, which includes \(e'\), \(D(e.O_r, e'.O_r) = D(e'.O_p, e'.O_r)\) and is already stored in \(e'\) together with \(e'.r(O_r)\) (see Fig. 1b). The distance \(D(e.O_r, q_i)\) can be used commonly for every \(e'\) once it is calculated; therefore, it can reduce the calculations of D distances.

In line 5, if e.n is a leaf node, in line 14, the FANN distance \(p.g_\phi\) is calculated as follows for each object p in e.n:

$$\begin{aligned}&p.g_\phi = \min \left\{ g(p, Q_\phi ), Q_\phi \subseteq Q \right\} , \end{aligned}$$
(10)
$$\begin{aligned}&g(p, Q_\phi ) = \mathcal {G}\{ D(p, q_i), q_i \in Q_\phi \}. \end{aligned}$$
(11)

Here, it should be checked if the object p belongs to the POI set P. If the FANN distance of p is smaller than that of the FANN candidate object \(\hat{p}^*\), p is set as a new FANN candidate object. The cost of calculating the FANN distance of an object p is very high since the distance D between p and all query objects \(q_i\) should be obtained. Hence, in line 13, as in line 7, the objects that are unlikely to be FANN objects are pruned off at a lower cost. That is, \(p.G_\phi\) is calculated for each object p as follows:

$$\begin{aligned}&p.G_\phi = \min \left\{ G(p, Q_\phi ), Q_\phi \subseteq Q \right\} , \end{aligned}$$
(12)
$$\begin{aligned}&G(p, Q_\phi ) = \mathcal {G}\{ D_G(p, q_i), q_i \in Q_\phi \}, \end{aligned}$$
(13)

where \(D_G(p, q_i) = | D(e.O_r, q_i) - D(e.O_r, p) |\) (see Fig. 3b). The only difference from Eqs. (10) and (11) is that D is used in Eqs. (10) and (11) whereas \(D_G\) is used in Eqs. (12) and (13). Since \(e.O_r\) is the parent object in node n, which includes p, \(D(e.O_r, p) = D(O_p, p)\) and is already stored in the leaf node entry for p (see Fig. 1a). The calculations of D distances can be reduced since \(D(e.O_r, q_i)\) is commonly used for every p once it is calculated. In line 19, the FANN candidate object \(\hat{p}^*\) is returned as the final FANN object \(p^*\). The following Lemma 1 proves that the FANN-PHL algorithm is correct.

Fig. 3
figure 3

Finding entries/objects to prune in FANN-PHL

Lemma 1

The FANN-PHL algorithm does not cause any false drop.

Proof

In line 8, since it holds that \(D(p, q_i) \ge D(e', q_i) ~ (0\le i<M)\) for any object p included in the spherical region for \(e'\), it holds that \(g(p, Q_\phi ) \ge g(e', Q_\phi )\), i.e., \(p.g_\phi \ge e'.g_\phi\) for any \(Q_\phi\) (see Fig. 3a). If the condition in line 8 is not satisfied for the FANN candidate object \(\hat{p}^*\), i.e., if \(e'.g_\phi > \hat{p}^*.g_\phi\), it holds that \(p.g_\phi > \hat{p}^*.g_\phi\) for any object p in \(e'\). Therefore, \(e'\) can be safely discarded.

In line 7, it is always true that \(D(e', q_i) + e'.r(O_r) \ge | D(e.O_r, q_i) - D(e.O_r, e'.O_r) |\), i.e., \(D(e', q_i) \ge | D(e.O_r, q_i) - D(e.O_r, e'.O_r) | - e'.r(O_r) = D_G(e', q_i) ~ (0\le i<M)\) (see Fig. 3a). Hence, it holds that \(g(e', Q_\phi ) \ge G(e', Q_\phi )\), i.e., \(e'.g_\phi \ge e'.G_\phi\) for any \(Q_\phi\). If the condition in line 7 is not satisfied, i.e., if \(e'.G_\phi > \hat{p}^*.g_\phi\), it holds that \(e'.g_\phi > \hat{p}^*.g_\phi\). Therefore, \(e'\) can be discarded safely based on the proof for line 8.

In line 13, it is always true that \(D(p, q_i) \ge | D(e.O_r, q_i) - D(e.O_r, p) | = D_G(p, q_i) ~ (0\le i<M)\) (see Fig. 3b). Hence, for any \(Q_\phi\), it holds that \(g(p, Q_\phi ) \ge G(p, Q_\phi )\), i.e., \(p.g_\phi \ge p.G_\phi\). If the condition in line 13 is not satisfied, i.e., if \(p.G_\phi > \hat{p}^*.g_\phi\), it holds that \(p.g_\phi > \hat{p}^*.g_\phi\), and therefore p can also be discarded safely.

In conclusion, considering all the aforementioned proofs together, the FANN-PHL algorithm in Algorithm 1 does not cause any false drop. \(\square\)

Algorithm 1 applies to the case where the number of FANN objects k is 1, and it can be extended to the general case of \(k \ge 1\) as follows. First, an array K is allocated to store k FANN result objects and initialized as \(K_i.g_\phi = \infty ~ (0 \le i < k)\). The FANN candidate objects in K are always ordered by their respective \(K_i.g_\phi\) values. In lines 7, 8, 13, and 14 in Algorithm 1, comparisons are made with \(K_{k-1}.g_\phi\) instead of \(\hat{p}^*.g_\phi\). When the condition in line 14 is satisfied, a new object p is inserted into K, and the previous object \(K_{k-1}\) is removed. Finally, the array K is returned in line 19.

4 Experimental evaluation

In this section, we compare the search performance of our FANN-PHL algorithm with that of the IER-kNN algorithm [33] through a series of experiments using real road network datasets. The platform is a workstation with AMD 3970X CPU, 128GB memory, and 1.2TB SSD. We implemented both FANN-PHL and IER-kNN in C/C++. In IER-kNN [33], several methods were compared to find the FANN distance between an object p and a set of query objects Q, and IER-PHL using PHL [4, 5] was shown to be the most efficient through experiments. In our experiments, the performance of our algorithm is compared with IER-kNN using IER-PHL.

The datasets used in the experiments are real road networks of five regions in the U.S. These datasets have been used in the 9th DIMACS Implementation Challenge − Shortest PathsFootnote 1 and many previous studies [2, 33]. Table 2 summarizes the datasets used in the experiments, where each dataset is a graph consisting of a set of vertices and a set of undirected edges. Each vertex represents a point (i.e., an object) in the road network, and each edge represents the road segment directly connecting two vertices. Since these datasets contain noise such as self-loop edges and unconnected graph segments [33], we performed data pre-processing to remove them. To quickly obtain the shortest-path distance D between two objects (vertices), we used the original C/C++ source code written by the creators of the PHL algorithmFootnote 2. Table 3 summarizes the sizes of each road network dataset and the corresponding M-tree and R-tree. The units are all in MB. The node size of all indexes was fixed as 4KB, which is most widely used in diverse database applications. Table 4 summarizes the parameters to be considered in the experiments, where the default parameter values are given in parentheses.

Table 2 Road network datasets
Table 3 Road network dataset and index sizes (in MB)
Table 4 Experiment parameters

In the first experiment, we compared the execution time needed for FANN search and the number of index node accesses for all road network datasets listed in Table 2. All the other parameters were set to the default values in Table 4. Figure 4 shows the results of the first experiment; the values in this figure are the averages of the results obtained by 1000 randomly generated query sets. The results for the aggregate functions \(\mathcal {G}\) = max and sum were represented by adding “MAX” and “SUM” to the names of the two algorithms, respectively, e.g., FANN-PHL-MAX and FANN-PHL-SUM. As shown in this figure, both FANN search algorithms showed similar trends in the execution time and the number of index node accesses. The number of objects included in the query region of the same size increased with the size of the road network. Therefore, the number of distance calculations to them and the execution time also increased. In the first experiment, FANN-PHL consistently outperformed IER-kNN with the improvement ratio of up to 4.75 times for the W dataset and \(\mathcal {G}\) = max.

Fig. 4
figure 4

Comparison of FANN performance for various road network datasets (\(\mathcal {R}\))

From the second through the last experiments, we use NW (Northwest USA) dataset as the default. As given in Table 4, the default values of M, k, \(\phi\), and C are 256, 1, 0.5, and 0.10, respectively. In the second experiment, we compared the FANN search performance while changing the number of the nearest objects k, and the results are shown in Fig. 5. For both FANN-PHL and IER-kNN, since the pruning bound increased with k, more index nodes were visited, and the execution time also increased. In this experiment as well, FANN-PHL consistently outperformed IER-kNN with a performance improvement of up to 2.40 times for k = 1 and \(\mathcal {G}\) = max.

Fig. 5
figure 5

Comparison of FANN performance for various numbers of nearest neighbors (k)

In the third experiment, we compared the performance of FANN search for various flexibility factor \(\phi\) values, and the results are shown in Fig. 6. It can be observed that, as \(\phi\) increased, the execution time and the number of index node accesses of IER-kNN increased. This is because, for higher \(\phi\), the FANN distance \(\hat{p}^*.g_\phi\) of the FANN candidate object \(\hat{p}^*\) becomes larger, and more R-tree nodes are visited. In contrast, for FANN-PHL, even with an increase in \(\phi\), the execution time and the number of M-tree node accesses decreased. This is because, as \(\phi\) increases in line 8 in Algorithm 1, \(e'.g_\phi\) for an entry \(e'\) also increases faster than \(\hat{p}^*.g_\phi\). When calculating \(e'.g_\phi\), \(Q_\phi\) is composed of the query objects closest to \(e'\) among those in Q, so for a smaller \(\phi\), it is likely that more query objects \(q_i ~ (\in Q_\phi )\) are included in the spherical region of \(e'\). Since we have \(D(e', q_i) = 0\) for these \(q_i\) as \(q_2\) in Fig. 2, \(e'.g_\phi\) also becomes zero or very close to zero. However, for a larger \(\phi\), the probability decreases, and it is more likely that \(e'.g_\phi > \hat{p}^*.g_\phi\). Therefore, a smaller number of entries \(e'\) are added into H as \(\phi\) increases. In this experiment as well, FANN-PHL consistently showed a better performance than IER-kNN with a performance improvement of up to 6.92 times for \(\phi\) = 1.0 and \(\mathcal {G}\) = max.

Fig. 6
figure 6

Comparison of FANN performance for various flexibility factors (\(\phi\))

In the fourth experiment, we compared the performance of FANN search while changing the coverage ratio C of query objects, where C denotes the ratio of the minimum area including all query objects to the area occupied by all road network objects. Figure 7 shows the experimental results. For higher C, the number of index nodes included in the query object area increases, and the execution time becomes larger. In this experiment, FANN-PHL consistently performed better than IER-kNN with a performance improvement of up to 3.06 times for C = 0.2 and \(\mathcal {G}\) = max.

Fig. 7
figure 7

Comparison of FANN performance for various coverage ratios of query (C)

In the final experiment, we compared the performance of FANN search while changing the number of query objects M, and the results are shown in Fig. 8. For both algorithms, we found that, as M increased, the number of index node accesses remained almost constant while the execution time increased linearly. This is because, even though M increases, there are no noticeable variations in \(\hat{p}^*.g_\phi\) and \(e'.g_\phi\) since the area of query objects remains similar. The number of M-tree nodes accessed by FANN-PHL was much smaller than the number of R-tree nodes accessed by IER-kNN (Fig. 8b). Meanwhile, as M increased, the number of calculations of distance D increased linearly for both algorithms (Fig. 8c). This is because the actual distance D to all M query objects \(q_i\) should be calculated to obtain \(\hat{p}^*.g_\phi\) and \(e'.g_\phi\). Owing to these two factors, the execution time of both algorithms increased linearly with M. In this experiment as well, FANN-PHL consistently outperformed IER-kNN with a performance improvement of up to 2.67 times for M = 64 and \(\mathcal {G}\) = max.

Fig. 8
figure 8

Comparison of FANN performance for various number of query objects (M)

5 Conclusions

This study proposed the FANN-PHL algorithm for efficient exact k-FANN search using the M-tree [13]. The state-of-the-art IER-kNN algorithm [33] used the R-tree [24] and pruned off the index nodes that are unlikely to include the final result object using the Euclidean distances. However, IER-kNN made many unnecessary accesses to index nodes and thus performed many calculations of the shortest-path distances to the objects included in the unnecessary nodes since the Euclidean distances are significantly different from the actual shortest-path distances between objects in road networks. Our FANN-PHL algorithm can prune off the index nodes more accurately than IER-kNN by using the M-tree, which is constructed based on the actual distances between objects, and can also dramatically reduce the calculations of the shortest-path distances. To the best of our knowledge, FANN-PHL is the first exact k-FANN algorithm that uses the M-tree. We proved that our algorithm does not cause any false drop. Through a series of experiments using various real road network datasets, we demonstrated that FANN-PHL consistently outperformed IER-kNN for all datasets and parameters with a performance improvement of up to 6.92 times.