1 Introduction

Given a graph G, can the k nearest neighbor (kNN) nodes to a user-specified query node be found efficiently? In addition, if each node in the graph has a set of node attributes, how can we handle them in k-NN queries? This study presents a fast graph indexing algorithm to efficiently find attributed kNN nodes in large complex networks against the user-specified query nodes.

As social applications advance, complex networks (or graphs) are becoming increasingly important for representing complicated data (Shiokawa et al. 2019; Shiokawa 2021). To handle such networks, kNN queries (Zhong et al. 2015; Li et al. 2019; Samet et al. 2008; Lee et al. 2012; Goldberg and Harrelson 2005) are essential building blocks for various applications (Chen et al. 2020; Alom et al. 2018; Boahen et al. 2020; Kesarwani et al. 2020; Ni et al. 2016; Mei et al. 2014; Li et al. 2015; Koçak et al. 2016; Zhang et al. 2018; Song and Park 2016; Cheema 2018; Chen et al. 2020). Given a query node in a graph, the k-NN query explores a set of nodes with the top-k shortest path distances from the query. Unlike traditional distance-based queries (Bast et al. 2006; Geisberger et al. 2008; Jing et al. 1998; Jung and Pramanik 2002; Samet et al. 2008; Sankaranarayanan et al. 2009), k-NN queries can return a result within a short time because they do not compute the entire graph. Owing to this feature, kNN queries have been employed in various social applications, as described below:

Ride-sharing services: Ride-sharing services are a popular application of kNN queries and are used to alleviate travel costs, such as fares, energy consumption, and traffic pressure. Various rider-driver matching algorithms have recently been proposed to increase cost efficiency by using ride-sharing services (Chen et al. 2011; Ma et al. 2013; Asghari et al. 2016; Ta et al. 2018; Chen et al. 2020). For example, Chen et al. proposed DEFC, which is a travelling cost estimation model based on kNN queries, to a rider-driver matching network Chen et al. (2020). By finding the top-k similar drivers from the network, the DEFC effectively predicts the number of passengers and vehicles. Chen et al. experimentally confirmed that their approach, which was based on kNN queries, achieved better cost reductions than competitive approaches in real-world applications.

Spam and fake news detection: In recent social networking services, the total number of user accounts has increased significantly. However, in these situations, some malicious users and fake accounts intentionally spread fake news and information on social applications to earn financial gains or political attention, etc. Therefore, spam and fake news detection methods are required to prevent non-malicious users from becoming malicious ones (Alom et al. 2018; Boahen et al. 2020; Kesarwani et al. 2020). Recently, Alom et al. proposed spam detection model for Twitter Alom et al. (2018). To effectively find malicious user accounts, Alom et al. extracted a set of user features using graph-based kNN queries, and they reported that their proposed model achieved a better performance compared to other state-of-the-art approaches. Analogously, Ankit et al. proposed a similar kNN-based model to detect fake news that were widely spread on social media Kesarwani et al. (2020). They reported that their model achieved more than 79% accuracy on a Facebook news post dataset.

In addition to the above applications, graph-based kNN queries can also be used in other applications, such as malware detection Ni et al. (2016), smart city applications Mei et al. (2014), healthcare data management (Li et al. 2015; Koçak et al. 2016; Zhang et al. 2018), and location-based social networking systems (Song and Park 2016; Cheema 2018; Chen et al. 2020).

Although kNN queries are useful in many applications, they have serious drawbacks when handling real-world complex networks. Specifically, traditional k-NN search algorithms (Zhong et al. 2015; Li et al. 2019; Samet et al. 2008; Lee et al. 2012; Goldberg and Harrelson 2005) are computationally expensive if the given graph is large. Historically, kNN queries have been applied to small graphs, such as ego networks and road networks, which have a maximum of a few thousand nodes. By contrast, recent social networking applications must handle large-scale complex networks with a few million nodes Shiokawa (2021). In other words, applications suffer from an extensive computation time to query kNN nodes when traditional algorithms run queries. Thus, these algorithms fail to find kNN nodes in large-scale real-world complex networks.

1.1 Existing approaches and challenges

Various approaches have been proposed to overcome the expensive costs in kNN queries. Graph indexing methods are the most successful to date (Zhong et al. 2015; Li et al. 2019; Abeywickrama and Cheema 2017; Huang et al. 2006; Samet et al. 2008). Examples include G-Tree Zhong et al. (2015) and ILBR Abeywickrama and Cheema (2017). To achieve a fast kNN query on a graph, they constructed an index by partitioning a graph and pre-computing the shortest path distances among the nodes in the graph before running a kNN query. For instance, G-Tree (Zhong et al. (2015) partitions a graph into disjoint subgraphs using Metis Karypis and Kumar (1995) and pre-computes the distances among all the nodes included in each subgraph. Similarly, ILBR Abeywickrama and Cheema (2017) selects several landmark nodes from the graph and constructs a Voronoi subgraph Okabe et al. (2000) using those landmarks. Subsequently, it pre-computes the distance from the landmarks to the nodes in each Voronoi subgraph. By searching for kNN nodes on the index, these methods avoid computing unnecessary nodes and edges, thereby achieving a fast kNN query.

Although indexing methods improve the efficiency of the kNN query, they cannot handle large complex networks for three reasons. First, indexing complex networks is expensive; this is because they are designed under the assumption that most networks are planar graphs Wang et al. (2020). Their partitioning methods are efficient for planar graphs, such as road networks, but their assumptions are not suitable for complex networks with diverse structures Onizuka et al. (2017). Second, a long running time is necessary for a k-NN query on complex networks because of the above assumption. Regardless of indexing, they must compute many distances for complex networks at the querying time because the indexed subgraphs do not include many edges in the complex networks. Finally, the use of node attributes in complex networks has not been studied in existing works (Zhong et al. 2015; Li et al. 2019; Abeywickrama and Cheema 2017), although current networks (e.g., social networks) maintain various metadata for every node (Matsugu et al. 2020; 2021). Thus, a fast graph indexing algorithm for efficiently attributed k-NN queries remains elusive.

In summary, there are three main challenges in the graph indexing methods.

  • How to achieve fast indexing when a graph can be modeled as a non-planar graph, such as complex networks, e.g., scale-free and small-world graphs?

  • On such complex networks, how can we efficiently find kNN nodes against a user-specified query node by using the graph-based index?

  • If nodes in complex networks have several node attributes, how does the graph indexing method handle them for efficient kNN queries?

1.2 Our approaches

Our goal is to achieve fast indexing to efficiently find the exact kNN nodes on large-scale attributed complex networks. Thus, we present a fast graph indexing method, core-tree-aware (CT) index, based on the well-known topological property of complex networks called the core-tree property Shavitt and Tankel (2008). Real-world complex networks typically comprise two parts: core and trees. The core is a small dense subgraph whose nodes are highly connected within the subgraph. By contrast, a tree is a long stretched sparse subgraph. As reported in Benson and Kleinberg (2019), the vast majority of nodes are included in trees, and only 0.6–9.3% of the nodes are contained in the core for a typical complex network.

Based on the above property, our algorithm, CT index, reduces indexing and kNN query costs by separately indexing the core and tree, respectively. First, it extracts trees from a graph and generates a tree-index by computing the distances between the root node and each leaf node in a tree. It then constructs a core-index from non-tree nodes by updating the edge weights of the graph. In the kNN query process, our algorithm explores the nodes on the core-index by pruning unnecessary computations for trees based on the tree-index.

Among the above indexing approaches, we further propose efficient techniques to maintain node attributes in kNN queries when each node in a complex network is attached to a set of attribute labels. Specifically, we first design a novel kNN query problem that handles node attributes (denoted as attributed kNN) and then provide the bipartite attribute graph (BAG) index, which is a novel index for maintaining the node attribute. By integrating the core- and tree-indices with the BAG index, we extend the above kNN query algorithm so that it efficiently explores attributed kNN nodes with node attributes similar to the user-specified query.

1.3 Contributions

Our contribution is summarized as follows:

  1. (1)

    A Novel Indexing Technique. We develop a novel graph indexing method, namely CT index, for efficient kNN queries based on the well-known topological properties of complex networks. Specifically, we construct the core-index and tree-index for the core and tree parts included in a complex network by handling the core-tree property (Sect. 3.2).

  2. (2)

    A Fast kNN Query Processing. We propose a fast kNN query processing algorithm using CT index. Our algorithm reduces the number of traversed nodes and distance computations during a kNN query by exploiting the core-tree property via the core- and tree-indices (Sect. 3.3).

  3. (3)

    Extending Our Indexing Methods to Attributed Graphs. We show that our graph indexing method also works for attributed graphs, where each node has one or more node attributes. Specifically, we extend CT index, allowing it to handle node attributes by using BAG index within a short index construction time (Sect. 4.1).

  4. (4)

    A Fast Attributed kNN Query Processing. By following the above extension, we propose a fast attribute-aware kNN query processing algorithm using both CT and BAG indices. Our algorithm avoids unnecessary traversals and computations using our core-tree-based indexing techniques, even if the node attributes are available (Sect. 4.2).

  5. (5)

    Empirical Analysis. We conduct empirical evaluations using several real-world datasets. Our techniques achieve faster indexing than other state-of-the-art methods proposed in the last few years (Sect. 5.2.1). We experimentally confirm that our methods outperform other state-of-the-art methods by up to four orders of magnitude in terms of the index construction time. In addition, our proposed methods are faster than other state-of-the-art methods in terms of the kNN query time (Sect. 5.2.2). We experimentally demonstrate that it is up to 146 times faster than other state-of-the-art methods.

  6. (6)

    Correctness of Search Results. We prove that our indexing and kNN query processing methods theoretically guarantee the identification of the same kNN nodes although they reduce the number of traversed and computed nodes by using our indexing approach (Theorems 1 and 3). That is, our proposed methods can inherit the effectiveness of kNN queries in various applications while achieving running-time improvements in both index constructions and query processing.

Our algorithm is the first to achieve fast indexing and kNN queries on large-scale attributed complex networks. For instance, our methods generate an index for a social network with 1.6 million nodes within 5 seconds and finds exact kNN nodes within 1 second. Note that contributions (3), (4), (5), and (6) listed above include new materials compared to the conference version detailed in literature Kobayashi et al. (2021). Although kNN queries are useful in real-life applications, their application to large attributed graphs is difficult because of the aforementioned challenges described in Sect.  1.1. By providing our fast approaches, graph kNN queries will not only further enhance the quality of various applications, but also expand their applications.

1.4 Organization

The rest of this paper is organized as follows: Sect. 2 provides key definitions and presents the problem statements addressed in this study. Based on the topological properties of complex networks, we first propose novel indexing and kNN query processing techniques for non-attributed graphs in Sect. 3. Our new indexing and kNN query processing techniques, aimed at handling node attributes, are presented in Sect. 4. Section 5 presents our empirical analysis using real-world complex networks. A brief overview of related work is provided in Sect. 6. Finally, Sect. 7 concludes this paper.

Table 1 Definitions of main symbols

2 Preliminary

We first briefly define the notations and then provide the formal problem definitions addressed in this paper. Table 1 summarizes the notations used in this paper for convenience.

This study focuses on a weighted undirected graph \(G = (V, E, W, A)\), where V, E, W, and A are the sets of nodes, edges, edge-weight values, and node attributes, respectively. If graph G has an edge between nodes \(u, v \in V\), it is denoted by \(e(u, v) \in E\). For each edge \(e(u, v) \in E\), an edge-weight value \(w(u, v) \in W\) must be defined, such that \(w(u, v) \in \mathbb {N}\). The node attributes in A are discrete and non-numerical labels that characterize the properties of each node in V. If \(A \ne \emptyset \), each node \(u \in V\) has at least one node attribute; otherwise, the node has no attributes. We denote a set of node attributes of node u as \(\textit{attr}(u)\), which indicates that \(A = \bigcup _{u \in V} \textit{attr}(u)\). Henceforth, for ease of presentation, we refer to a graph as \(G = (V, E, W)\) if \(A = \emptyset \).

2.1 kNN query processing problem

In this section, we provide a formal problem definition for the kNN query processing. We first introduce the following definition on a graph G.

Definition 1

(Shortest path distance) Let a node path \(u = u_0 \rightarrow u_1 \rightarrow u_2 \rightarrow \dots \rightarrow u_i = v\) in G be the shortest path between the nodes \(u, v \in V\). We denote the shortest path distance between nodes u and v as \(\textit{dist}(u, v)\), which is defined as follows:

$$\begin{aligned} \textit{dist}(u, v) = \sum _{i=0}^{i-1} w(u_i, u_{i+1}). \end{aligned}$$

For convenience, we use \(\textit{dist}_{k}(q, V)\) as the k-th smallest distance in \(\{\textit{dist}(q, v) \mid v \in V\}\).

Based on Definition 1, we define the kNN query processing problem as follows:

Problem 1

(kNN query processing) Given a graph \(G = (V, E, W)\), query node \(q \in V\), and integer \(k \in \mathbb {N}\), the kNN query is the problem of finding a set of nodes \(V_{k}(q)\) from G, which is defined as follows:

$$\begin{aligned} V_{k}(q) = \left\{ v \in V \mid \textit{dist}(q, v) \le \textit{dist}_{k}(q, V) \right\} . \end{aligned}$$

2.2 Attributed kNN query processing problem

Next, we introduce the attributed kNN query processing problem, which handles node attributes in kNN queries. To handle the attributes effectively, we first define a similarity function for the node attributes as follows:

Definition 2

(Attribute similarity function) Given two nodes \(u, v \in V\), we denote the attribute similarity between nodes u and v by \(\textit{sim}(u, v)\), defined as follows:

$$\begin{aligned} \textit{sim}(u, v) = \frac{|\textit{attr}(u) \cap \textit{attr}(v)|}{\sqrt{|\textit{attr}(u)|\cdot |\textit{attr}(v)|}}. \end{aligned}$$

As shown in Definition 2, \(\textit{sim}(u, v)\) takes a real value between zero and one since it is based on the structural similarity (Xu et al. (2007)). If \(\textit{sim}(u, v) = 1\), nodes u and v have the same node attributes, i.e., \(\textit{attr}(u) = \textit{attr}(v)\). In contrast, these nodes share no attributes if \(\textit{sim}(u, v) = 0\). Note that the proposed approaches introduced in the following sections are not limited to Definition 2, although we use the above function based on structural similarity. Specifically, other similarity measures, such as the Dice similarity and Jaccard similarity, are also applicable to our method.

Based on the above definitions, we finally formalize the attributed kNN query processing problem as follows:

Problem 2

(Attributed kNN query processing) Given a graph \(G = (V, E, W, A)\), query node \(q \in V\) with node attributes \( \textit{attr}(q)\), integer \(k \in \mathbb {N}\), and similarity threshold \( \theta \in \mathbb {R}\), the attributed kNN query is the problem of finding a set of nodes \(V_{k, \theta }\) from G, which is defined as follows:

$$\begin{aligned} V_{k,\theta } = \left\{ v \in V_{\theta }(q) \mid \textit{dist}(q, v) \le \textit{dist}_{k}(q, V_{\theta }) \right\} , \end{aligned}$$

where \(V_{\theta }\) is a set of nodes having the larger attribute similarity than \(\theta \), i.e., \(V_{\theta }(q) = \{ v \in V \mid \textit{sim}(q, v) \ge \theta \}\).

That is, Problem 2 is the task of finding kNN nodes that have at least a larger attribute similarity value than \(\theta \) against the user-specified query node q.

3 Core-tree-aware graph indexing

Here, we present CT index, a core-tree-aware indexing method to efficiently solve the kNN query processing problem (Problem 1) in large complex networks. In this section, we provide the basic idea underlying our proposed method CT index in Sect. 3.1, which is followed by details of CT index. Specifically, we introduce our indexing and k NN query algorithms in Sects.  3.2 and 3.3, respectively.

3.1 Basic ideas

The goal of our proposed method is to achieve fast indexing and efficient kNN queries on large-scale complex networks. As described in Sect. 1, existing graph partitioning-based indexing approaches, such as G-Tree Zhong et al. (2015), have high computational costs when constructing an index. This is because partitioning approaches assume that graphs can be modeled as planar graphs, such as road networks. However, this assumption is not suitable for real-world complex networks since these networks generally have diverse topological structures. For example, (Leskovec and Krevl 2014; Shiokawa et al. 2015) reported that complex networks had very diverse structures, which were not restricted to the planar graph model. Furthermore, the small-world property of complex networks yields many short-cut paths between two nodes, which causes overheads for the index construction and kNN queries. From this reason, it is hard for partitioning approaches to obtain good partitions from complex networks Onizuka et al. (2017). Thus, the existing approaches suffer from the high computational costs of large-scale complex networks.

In this study, we focus on the structural properties of complex networks to overcome the high costs of graph indexing for kNN queries. To achieve faster approaches, we design a core-tree-aware index, called CT index, that handles the core-tree properties of complex networks Shavitt and Tankel (2008). As described in Sect. 1, real-world complex networks can be divided into two parts: a small core and a set of trees. The core part is known to be very similar to an expander graph Maehara et al. (2014). Thus, the core part requires exhaustive shortest-path distance computations to determine the kNN nodes in the core. In contrast, we do not need to compute the distance for the trees. This is because the distance can be trivially obtained from the tree height in many cases. From these observations, we propose CT index comprised of the core-index and tree-index for the core and tree parts, respectively. In kNN queries, our algorithm explores nodes on the core-index by pruning unnecessary computations for trees based on the tree-index.

3.2 CT index: core- and tree-indexing algorithm

Based on the core-tree property, our algorithm generates a CT index \(\mathcal {I} = \langle \mathcal {T}, \mathcal {C} \rangle \), where \(\mathcal {T}\) and \(\mathcal {C}\) are the tree-index and core-index, respectively. At the beginning of graph indexing, it first extracts trees from a graph and generates a tree index, as defined below, by computing the distances between the root node and each leaf node in a tree.

Definition 3

(Tree-index \(\mathcal {T}\)) Let \(T_1, T_2, \dots , T_i\) be the trees included in G and \(r_i\) be the root node of \(T_i\), we denote \(D_i\) as a set of distances between \(r_i\) and each node \(v \in T_i\), i.e., \(D_i = \bigcup _{v \in T_i}\{\textit{dist}(r_i, v)\}\). The tree-index is defined as \(\mathcal {T} = (\mathbb {T}, \mathbb {D})\), where \(\mathbb {T}\) and \(\mathbb {D}\) are the sets of \(T_i\) and \(D_i\), respectively, i.e., \(\mathbb {T} = \{T_1, T_2, \dots , T_i \}\), and \(\mathbb {D} = \{D_1, D_2, \dots , D_i\}\).

As shown in Definition 3, a tree-index \(\mathcal {T}\) is a set of trees in G along with the shortest path distance between the tree root and each leaf node. After the tree-index construction, our proposed algorithm generates a core-index from the non-tree nodes. Formally, core-index is defined as follows:

Definition 4

(Core-index \(\mathcal {C}\)) Let \(V_c\) be a set of nodes included in a core of G. The core-index is defined as \(\mathcal {C} = (V_c, E_c, W_c)\), where \(E_c = \{ e(u, v) \in E \mid u, v \in V_c \}\), and \(W_c = \{ \textit{dist}(u, v) \mid e(u, v) \in E_c \}\).

Note that in the core-index, the edge-weight values are updated to the shortest path distance in the set of core nodes \(V_c\). Unlike the tree-index, which maintains \(\mathbb {D}\) shown in Definition 3, two nodes in \(V_{c}\) might be reachable via multiple paths along with different shortest path distances. To avoid redundant search overhead when querying nodes, we thus need to replace the weight values with the shortest path distances. To this end in this paper, we updated the weight values by using Dijkstra algorithm.

figure a

Algorithm: Here, we present a concrete index construction procedure. To simplify the presentation, we define the label function \(f_l\), which identifies a node that is included in either the core or tree-index.

Definition 5

(Label function \(f_l\)) The label function \(f_l\) is defined as \(f_{l}(u) = \textit{tree}\), if \(u \in \mathcal {T}\). Otherwise, \(f_{l}(u) = \textit{core}\).

Algorithm 1 is a pseudocode for our algorithm to generate \(\mathcal {I} = \langle \mathcal {T}, \mathcal {C} \rangle \). This algorithm has two components: tree-indexing (lines 1-7) and core-indexing (lines 8-14). Given a graph G, the goal of the tree-indexing step is to construct tree-index \(\mathcal {T}\) described by Definition 3. First, the algorithm extracts all trees \(\mathbb {T} = \{T_1, T_2, \dots , T_i \}\) in G. To obtain \(\mathbb {T}\), we run the \(\textsc {ExtractTrees}\) function (line 1), which employs the incremental aggregation method Shiokawa et al. (2013). This aggregation method is summarized as follows:

  1. 1.

    Select a node \(u \in V\) whose degree is one.

  2. 2.

    Aggregate u into its adjacent nodes.

  3. 3.

    Repeat the steps 1 and 2 until all nodes have at least two neighbor nodes.

  4. 4.

    Output the aggregated nodes as a set of trees \(\mathbb {T}\).

Once a node \(u \in V\) is aggregated into its neighbor node \(v \in V\) by steps 1 and 2, the degree of v decreases. For instance, the degree of v becomes 1 after aggregating u and v if the degrees of u and v are 1 and 2 in G, respectively. Hence, the function can find a set of trees \(\mathbb {T} = \{T_1, T_2, \dots , T_i\}\) by repeating steps 1 and 2 until all nodes are adjacent to at least two nodes. As discussed in Shiokawa et al. (2013), this aggregation method incurs a linear cost against the number of nodes in the trees. Thus, by letting \(\alpha \) be the fraction of nodes in the trees, the function consequently incurs \(O(\alpha |V|)\) time.

We now have a set of trees \(\mathbb {T} = \{ T_1, T_2, \dots , T_i \}\), each of which is a set of nodes included in a tree. Algorithm 1 computes \(D_i\) in Definition 3 for each tree \(T_i \in \mathbb {T}\) (lines 2–6). Finally, it outputs tree-index \(\mathcal {T} = (\mathbb {T}, \mathbb {D})\) (line 7).

Next, the core-indexing step generates the core-index \(\mathcal {C}\) (lines 8-14). As shown in Definition 4, \(\mathcal {C}\) comprises \(V_c\), \(E_c\), and \(W_c\). In lines 8-9, the algorithm constructs a set of nodes \(V_c\) that includes all non-tree nodes and all root nodes in \(\mathbb {T}\). Algorithm 1 then links the nodes in \(V_c\) if they have edges in E (Line 10). Finally, the algorithm updates the weights of edges (line 11-12). As shown in line 12, each weight is set as the shortest path distance between two adjacent core nodes.

In Algorithm 1, by letting \(\alpha \) be the fraction of tree nodes in G, the tree-index \(\mathcal {T}\) has \(\alpha |V|\) nodes and at most \(\alpha |V|\) edges. That is, lines 2-8 in Algorithm 1 incurs \(O(\alpha |V|)\) time. Since \(\textsc {ExtractTrees}\) function incurs \(O(\alpha |V|)\) time, he time complexity of the tree-indexing should be \(O(\alpha |V|)\) time. Similarly, the core-index \(\mathcal {C}\) contains \((1-\alpha )|V|\) nodes and \(|E|-\alpha |V|\) edges. The core-indexing invokes Dijkstra algorithm to update the edge-weight values. Thus, the core-indexing incurs \(O( (|E|-\alpha |V|) + (1-\alpha )|V| \log \{ (1-\alpha )|V|\} )\) time if Dijkstra algorithm employs Fibonacci heap. Therefore, Algorithm 1 totally requires \(O( |E| + (1-\alpha )|V| \log \{ (1-\alpha )|V|\} )\) time. As reported in Benson and Kleinberg (2019), vast majority of real-world graphs yield \(\alpha \ge 0.9\), which implies that the time complexity of Algorithm 1 is \(O( |E| + (1-\alpha )|V| \log \{ (1-\alpha )|V|\} )\approx O(|E|)\) time in practice. In Sect. 5.2.1, we will further discuss the practical indexing costs by using various real-world graphs.

figure b

3.3 kNN query processing algorithm

This section explains how a kNN query is computed on \(\mathcal {I} = \langle \mathcal {T}, \mathcal {C} \rangle \). To reduce the query time, our algorithm attempts to avoid computing the tree nodes. The algorithm starts a kNN search from a core node in \(\mathcal {C}\). Once it reaches root node r of tree \(T_i\), it examines whether computing \(T_i\) can be skipped by tree-index \(\mathcal {T}\). If \(T_i\) can be kNN nodes, then the algorithm adds them to \(V_{k}(q)\) without computing \(T_i\).

Algorithm: Algorithm 2 shows the pseudocode of the kNN query using \(\mathcal {I}\). Algorithm 2 comprises two components: the main search algorithm (lines 1-17) and a subroutine TreePruning (lines 18-29). The main search algorithm explores the kNN nodes for q using index \(\mathcal {I}\). At the beginning of the algorithm, a priority queue Q is initialized, in which the nodes are prioritized by their distance from q. In the first step (lines 1-7), our algorithm initializes Q based on label function \(f_{l}(q)\). If \(f_{l}(q) = \textit{core}\), then it simply inserts q into Q (lines 6-7). Otherwise, it invokes the subroutine \(\textsc {TreePruning}\) (lines 2-5).

Given a root node r of \(T_i\), \(\textsc {TreePruning}\) examines whether \(T_i\) is included in \(V_{k}(q)\) without computing the nodes in \(T_i\) (lines 18-29). First, we introduce the following definition:

Definition 6

(Upper bound \(\overline{d}\)) Let r be a root node of \(T_i\); the upper bound of the distances \(\overline{d}(r)\) is defined as follows:

$$\begin{aligned} \overline{d}(r) = {\left\{ \begin{array}{ll} \textit{dist}_{\text {max}}(T_i) - \textit{dist}(q, r) &{} (q \in T_i) \\ \textit{dist}(q, r) + \textit{dist}_{\text {max}}(T_i) &{} (\textit{Otherwise}) \end{array}\right. } \end{aligned}$$

where \(\textit{dist}_{\text {max}}(T_i) = \max \{ \textit{dist}(u, v) \mid \textit{dist}(u, v) \in D_i \}\),

Definition 6 indicates that \(\overline{d}(r)\) is the maximum distance between the query node q and a node in \(T_i\). That is, \(\overline{d}(r)\) is the upper bound of the distance between q and \(T_i\). From Definition 6, we have the following property:

Lemma 1

Let r be the root of \(T_i\) and \(d_{\text {min}} = \min \{\textit{dist}(q, v)\}\), where \(v \in Q \cup \{ v \mid e(r, v) \in E_c, v \notin V_{k}(q) \}\). If \(|V_{k}(q)|+|T_i| \le k\) and \(\overline{d}(r) \le d_{\text {min}}\), then \(T_i \subseteq V_{k}(q)\) holds.

Proof

If \(q \in T_i\), Lemma 1 trivially holds. Thus, we prove the lemma if \(q \notin T_i\) by contradiction. Assume \(u \in T_i\) but \(u \notin V_{k}(q)\). Because \(|V_{k}(q)| + |T_i| \le k\) and \(u \notin V_{k}(q)\) for at least one node \(u^{\prime } \in V\backslash \{V_{k}(q) \cup T_i\}\), \(\textit{dist}(q, u^{\prime }) < \overline{d}(r)\). This contradicts \(\overline{d}(r) \le d_{\text {min}}\). Hence, Lemma 1 holds. \(\square \)

Lemma 1 implies that we can prune the computation of trees satisfying the conditions shown in the lemma. Our method (lines 20–25) prunes \(T_i\) without computing \(T_i\) using Lemma 1. Otherwise, it regards \(T_i\) as the core nodes (lines 26–29), which are computed in the subsequent procedure (lines 12–16).

Finally, Algorithm 2 runs the kNN search step (lines 8-17). Once \(\langle u, \textit{dist}(q, u) \rangle \) is obtained from Q (line 9), the kNN search continues until \(V_{k}\) includes all nodes whose distance is smaller than \(\textit{dist}_{k+1}(q, V)\). If \(f_{l}(u) = \textit{tree}\), the algorithm examines \(\textsc {TreePruning}\) (lines 10-11) as well as the initialization step. Otherwise, it traverses the core nodes by updating their distances (lines 12-16).

Correctness of k NN Search Result: By following Lemma 1, the kNN query processing method shown in Algorithm 2 holds the following property after termination.

Theorem 1

(Correctness) \(V_{k}(q)\) obtained using Algorithm 2 is equivalent to the kNN nodes searched on G.

Proof

From Lemma 1, TreePruning prunes \(T_i\) only if all the nodes in \(T_i\) are included in \(V_{k}(q)\). Otherwise, the nodes are labeled as \(\textit{core}\) by Algorithm 2 (line 28). Theorem  1 holds because it traverses all core nodes until \(|V _{k} (q)|\) includes all nodes whose distance is smaller than \(\textit{dist}_{k+1}(q, V)\). \(\square \)

Theorem 1 indicates that our indexing and kNN query processing approaches do not sacrifice the kNN query quality.

4 Extension for attributed kNN queries

We then extend the CT index described in Sect. 3 to attribute the kNN query processing problem (Problem 2). In the following sections, we propose BAG index, which is an extended indexing of CT index and introduce kNN query algorithm using CT and BAG indices in Sects.  4.1 and 4.2, respectively.

4.1 BAG Index

First, we extend the CT index shown in Sect. 3 to efficiently compute attributed kNN queries on attributed complex networks.

To handle the attributed kNN queries, we need to specify the set of nodes \(V_{\theta }(q)\) shown in Problem 2, which yields a higher attributed similarity than the user-specified threshold \(\theta \) against a query node q. We here call \(V_{\theta }(q)\) candidate nodes of node q for simplicity. A naïve approach to finding all candidate nodes is to compute the attribute similarities by Definition 2 for all possible combinations of nodes at the attributed kNN query time. However, this approach is clearly time consuming because it requires at least O(|V|) time computations for each query. Hence, we propose an additional index for node attributes along with the CT index.

In this section, we propose the bipartite attribute graph (BAG) index to efficiently obtain candidate nodes during query processing. The BAG index is a simple bipartite graph representing the relationship between a set of nodes V and a set of node attributes A, which is formally defined as follows:

Definition 7

(BAG index) Given a graph \(G = (V, E, W, A)\), a BAG index \(\mathcal {B} = (V_{\mathcal {B}}, E_{\mathcal {B}})\) is a bipartite graph, where \(V_{\mathcal {B}}\) and \(E_{\mathcal {B}}\) are sets of nodes and edges in the bipartite graph, respectively. \(V_{\mathcal {B}}\) is partitioned into V and A such that \(V_{\mathcal {B}} = V \cup A\), and \(E_{\mathcal {B}} = \{ e(u, a) \mid u \in V, a \in A, a \in \textit{attr}(u) \}\),

Although Definition 7 is quite simple, it effectively enhances the applicability of our core-tree-aware indexing. In Sect. 4.2, we explain how the attributed kNN queries can be solved using both \(\mathcal {I}\) and \(\mathcal {B}\).


Algorithm: Finally, Algorithm 3 shows the pseudocode for the BAG index construction. In conjunction with the CT index construction shown in Algorithm 1, our proposed method invokes Algorithm 3 only if \(A \ne \emptyset \).

figure c

As shown in Algorithm 3, we must traverse at least |V| nodes and |A| node attributes to construct the BAG index \(\mathcal {B}\) from G. In other words, the BAG index construction costs \(O(|V|+|A|) \approx O(|V|)\) time as an additional overhead for index construction. This is because, as shown in Sect. 5, \(|A| \ll |V|\) in practical real-world graphs. To further discuss this overhead, in Sect. 5, we experimentally evaluate the impact of Algorithm 3 on the index construction time by using several real-world attributed graphs.

figure d

4.2 Attributed kNN query processing algorithm

We present an efficient algorithm for the attributed kNN query processing problem using the CT index \(\mathcal {I}\) and the BAG index \(\mathcal {B}\) simultaneously. Once a query node q and parameters k and \(\theta \) are specified, the proposed algorithm finds the attributed kNN nodes \(V_{k,\theta }(q)\) defined in Problem 2 by avoiding unnecessary similarity and distance computations. Specifically, our algorithm comprises two steps: (1) extracting candidate nodes and (2) querying kNN candidate nodes. In the following sections, we present a detailed algorithm of those steps.

4.2.1 Extracting candidate nodes

In the first step, our algorithm extracts the candidate nodes \(V_{\theta }(q)\) from G. As described in Sect. 4.1, the naïve approach imposes at least O(|V|) time to obtain the candidate nodes \(V_{\theta }(q)\) for each query. To avoid this high cost, we propose a simple but efficient candidate node extraction algorithm for the BAG index \(\mathcal {B}\).

Algorithm 4 shows the pseudo-code for candidate node extraction using the BAG index \(\mathcal {B}\). In this algorithm, we denote the set of neighbor nodes of node u in the BAG index \(\mathcal {B}\) as \(\Gamma (u)\), that is, \(\Gamma (u) = \{ v \in V_{\mathcal {B}} \mid e(u, v) \in E_{\mathcal {B}} \}\).

Once a query node q is specified, it extracts candidate nodes \(V_{\theta }(q)\) from the BAG index \(\mathcal {B}\) from node q (lines 3-12). From node q, the algorithm first obtains a set of nodes \(\in V\) that share at least one node attribute with node q by traversing the two-hop-away neighbor nodes of q (lines 3-4). Here, let v be a two-hop-away node. If v is not included in the candidate nodes, Algorithm 4 counts up \(\text {count}[v]\) (line 6) and verifies whether node v can be a member of \(V_{\theta }(q)\) (lines 7-8). Specifically, node v is a candidate node in \(V_{\theta }(q)\) when it satisfies \(\text {count}[v] \ge \theta \sqrt{|\Gamma (q)|\cdot |\Gamma (v)|}\). This is because Algorithm 4 always satisfies the following lemma.

Lemma 2

If \(\text {count}[v] \ge \theta \sqrt{|\textit{attr}(q)| \cdot |\textit{attr}(v)|}\), then \(\textit{sim}(q, v) \ge \theta \) always holds.

Proof

As shown in Algorithm 4, \(\text {count}[v] \le |\textit{attr}(q) \cap \textit{attr}(v)|\) clearly holds. Recall that \(|\Gamma (q)| = |\textit{attr}(q)|\) and \(|\Gamma (v)| = |\textit{attr}(v)|\) clearly hold from Definition 7. Thus, if \(\text {count}[v] \ge \theta \sqrt{|\textit{attr}(q)| \cdot |\textit{attr}(v)|}\), then we always have \(\textit{sim}(q, v) \ge \theta \). \(\square \)

Lemma 2 indicates that Algorithm 4 adds node v to \(V_{\theta }(q)\) only if \(\textit{sim}(q, v) \ge \theta \) holds (lines 7-8). Algorithm 4 continues the above procedures for all two-hop-away nodes of node q, and finally outputs the candidate nodes \(V_{\theta }(q)\).

After termination, Algorithm 4 satisfies the following property.

Theorem 2

\(V_{\theta }(q)\) obtained by Algorithm 4 is equivalent to the candidate nodes obtained by computing all attribute similarities among the nodes in V, i.e., \(\{ v \in V \mid \textit{sim}(q, v) \ge \theta \}\).

Proof

From Definition 2, \(\textit{sim}(q, v) \ge 0\) if and only if node v shares at least one node attribute with q. That is, if \(\textit{sim}(q, v) \ge 0\), node v must be included in the set of two-hop-away nodes of node q. In addition, as shown in Lemma 2, Algorithm 4 does not add any nodes v, such that \(\textit{sim}(q, v) < \theta \), into \(V_{\theta }(q)\). Hence, Algorithm 2 always outputs exact candidate nodes after termination. \(\square \)

figure e

4.2.2 Querying kNN candidate nodes

In the second step, the algorithm explores the attributed kNN nodes \(V_{k,\theta }(q)\) included in the candidate nodes \(V_{\theta }(q)\) using \(\mathcal {I} = \langle \mathcal {T}, \mathcal {C} \rangle \), and \(\mathcal {B} = (V_{\mathcal {B}}, E_{\mathcal {B}})\). To inherit the efficiency of kNN queries on \(\mathcal {I}\) described in Sect. 3.3, we propose a simple extension of Algorithm 2 to find kNN nodes only from candidate nodes \(V_{\theta }(q)\) obtained from Algorithm 4.

Algorithm 5 displays the pseudocode for the attributed kNN query processing. In the algorithm, we highlight several lines in gray, which are extended from the corresponding lines in Algorithm 2. As we can see from the algorithm, most parts are the same as those of Algorithm 2, except for the modifications in the \(V_{k,\theta }(q)\) construction processes. Specifically, in line 8, the algorithm verifies whether core node u yields a higher similarity than \(\theta \). If node u is included in the candidate nodes, it is merged into the attributed kNN nodes \(V_{k,\theta }(q)\); otherwise, it is discarded. Analogously, Algorithm 5 also performs the same verification for the trees in line 23 to preserve all attributed kNN nodes included in \(V_{\theta }(q)\).

The above verification imposes additional search costs for kNN queries; i.e., Algorithm 5 requires at most O(k) time for the verification in each attributed kNN query. Although the algorithm incurs the aforementioned overhead, it can inherit the tree-pruning techniques of Algorithm 2. That is, Algorithm 5 can skip exploring trees if no candidate nodes are included in the trees. In the experimental analysis in Sect. 5, we discuss the practical impact of this verification overhead on query processing time using several real-world datasets.

Finally, we discuss theoretical aspects of Algorithm 5. From Theorems 1 and 2, we can obtain the following property for Algorithm 5:

Theorem 3


(Correctness of Attributed kNN Query) The attributed kNN nodes \(V_{k,\theta }(q)\) obtained by Algorithm 5 are exactly the same as those exhaustively searched on G.

Proof

From Theorem 2, Algorithm 5 does not miss the possible nodes that can be included in the attributed kNN nodes \(V_{\theta }(q)\). In addition, as proven in Theorem 1, Algorithm 2 finds exactly the same kNN nodes from the search space V as those explored on G. As shown in Algorithm 5, the search space is replaced with \(V_{\theta }(q)\) from V by lines 8 and 23. In other words, Algorithm 5 does not miss the possible kNN nodes included in \(V_{\theta }(q)\), which completes the proof of Theorem 3. \(\square \)

Theorem 3 indicates that our algorithm does not sacrifice the attributed kNN query quality, although it dynamically prunes unnecessary computations by using \(\mathcal {I}\) and \(\mathcal {B}\).

5 Experimental analysis

Here, we experimentally discuss the efficiency of our proposed algorithms in terms of the indexing time and kNN querying time.

5.1 Experimental setting

Methods: We experimentally compared our methods with baseline methods including the state-of-the-art indexing algorithms. Here, we explain the overview of the methods evaluated in this section for Problems 1 and 2.


Methods for kNN query processing (Problem 1):

  • CT: Our proposed graph indexing method based on the core-tree property of complex networks. (Sect. 3). It first constructs an index based on Algorithm 1, and then performs Algorithm 2 for kNN queries.

  • G-Tree: The state-of-the-art graph indexing method for kNN queries on a graph (Zhong et al. 2015; Li et al. 2019). To construct an index, G-Tree partitions a graph into hierarchical subgraphs by using Metis Karypis and Kumar (1995).

  • ILBR: A landmark-based graph indexing method for kNN queries Abeywickrama and Cheema (2017). ILBR exploits the nested Voronoi diagram graph Okabe et al. (2000) as a graph index.

Methods for attributed kNN query processing (Problem 2):

  • CT\(+\)BAG: Our proposed indexing methods using the CT and BAG index simultaneously (Sect. 4). Before the attributed kNN queries, CT\(+\)BAG constructs both CT and BAG indices based on Algorithms 1 and 3. It then explores the attributed kNN nodes by following Algorithm 5.

  • CT\(+\)Naïve: A naïve indexing approach using CT index. It constructs CT index from a given graph, but do not uses BAG index presented in Sect. 4. At the querying time, this method first constructs the candidate nodes \(V_{\theta }(q)\) by computing all attribute similarities between a query node q and \(v \in V\), instead using the BAG index. It then performs Algorithm 5 for the candidate nodes \(V_{\theta }(q)\) as same as CT\(+\)BAG to obtain the attributed kNN nodes.

  • G-Tree\(+\)BAG A baseline indexing algorithm based on the state-of-the-art method G-Tree (Zhong et al. 2015; Li et al. 2019). This method constructs G-Tree along with BAG index to handle node attributes during the kNN queries.

Unless otherwise stated, we used the same parameter settings for G-Tree and ILBR, which were recommended in the original papers (Zhong et al. 2015; Li et al. 2019; Abeywickrama and Cheema 2017). Also, we use \(k = 0.01 \times |V|\) as a default k setting. All algorithms were implemented by C++ and compiled by gcc 9.2.0 using –O2 option. All experiments were conducted on a server with an Intel Xeon CPU (2.60 GHz) and 128 GiB RAM. We randomly selected 30 query nodes for each dataset. Here, we report their average time.


Datasets: For experimental analysis of kNN queries (Problem 1), we employed eight real-world graphs published in previous studies (Zhong et al. 2015; Li et al. 2019) and several public repositories (Demetrescu 2010; Leskovec and Krevl 2014). Table 2 shows the statistics of the real-world graphs. Note that the graphs shown in Table 2 have no attributes since the kNN query problem does not require any node attributes. To experimentally discuss the effectiveness of CT index, we employed two types of real-world graphs: road networks and social networks. In Table 2, CAL, NY, and FLA denote road networks, whereas the remaining indicate social networks.

Next, we used four real-world attributed graphs obtained from MovieLens 25M Dataset Harper and Konstan (2015) to evaluate the effectiveness of our attributed kNN query approaches. MovieLens 25M Dataset is a data collection of 5-star movie rating activities in a movie recommendation service MovieLens. This dataset contains 162,541 users rating activities across 62,423 movies with 25,000,095 ratings and 18 movie genre tags; each user rates a movie from 0 to 5.0 in increments of 0.5 points, and each movie has a subset of the 18 movie genre tags.

By using the MovieLens 25M Dataset, we generated social graphs \(G=(V, E, W, A)\) that represents user relationships by the following steps: (Step 1) We first add randomly selected users to V, and append edges to E if two users rated the same movie (Step 2) For each edge in \(e(u, v) \in E\), we assign the number of commonly rated movies between users u and v as an edge-weight value \(w(u, v) \in W\). (Step 3) We then assign node attributes to each user \(u \in V\) from a combination of a movie genre tag and a movie rates given by the user. If a movie with a genre “Horror” is rated as 2.5 by user u, we add an attribute “Horror (2.5)” to \(\textit{attr}(u)\). That is, our graphs totally have 180 attributes, i.e., \(|A|= |\bigcup _{u \in V} \textit{attr}(u)| = 180\). (Step 4) Finally, we drop off a set of unpromising edges in E whose edge-weight value is not included in the top-25% in W. By following these steps, we generated four attributed graphs. Specifically, we varied the number of randomly selected users in (Step 1) as 1000, 5000, 10,000, and 50,000, and we constructed four graphs summarized whose statistics are summarized in Table 3.

Table 2 Statistics of real-world graphs for kNN queries (Problem 1)
Table 3 Statistics of real-world attributed graphs for attributed kNN queries (Problem 2)

5.2 Evaluating efficiency for kNN queries

We first evaluate the running time efficiency of our proposed methods for kNN queries defined in Problem 1.

5.2.1 Index construction time

Figure 1 shows the indexing time on the real-world datasets in Table 2. The results of G-Tree are omitted from GV, NS, AT, and SP, as indexing on these datasets was not finished within one hour. Our algorithm significantly outperformed G-Tree and ILBR under all examined conditions. It demonstrated an improved indexing efficiency that was up to four orders of magnitude higher than that of the aforementioned state-of-the-art methods. For instance, our method was 18,074 times faster than G-Tree on a TV. A comparison of the indexing time by graph type (i.e., road or social networks) demonstrates that our algorithm is better suited for social networks than for road networks. This is because Algorithm 1 can reduce its running time if a graph contains many trees. This is because each tree \(T_i\) requires only \(O(|T_i|)\) time for distance pre-computations. Social networks inherit the core-tree property of complex networks, and most subgraphs of social networks are trees. Hence, our algorithm can significantly reduce the indexing time for social networks.

Fig. 1
figure 1

Indexing time for kNN queries

Fig. 2
figure 2

kNN query time (\(k = 0.01 \times |V|\))

5.2.2 kNN query time

Figure 2 shows the kNN query time when \(k = 0.01 \times |V|\). The results are omitted if the kNN query does not finish within 1 minute or if its index is not available in Fig. 1. Our algorithm significantly outperformed the other methods for social networks. By contrast, its improvements were relatively small for road networks. This is because our algorithm can skip search trees by Lemma 1. As such, it can improve indexing efficiency by up to two orders of magnitude on social networks, when compared to the aforementioned state-of-the-art methods. Our algorithm achieved quick kNN query processing for NS, AT, and SP, whereas the other methods failed to compute the query.

Figure 3 shows the impact of k for query time. We varied k as \(0.001 \times |V|\), \(0.01 \times |V|\), and \(0.1 \times |V|\). We report results only for NY, FLA, TV, and SP because the other datasets were almost same as those of Fig. 3. Results are omitted if the kNN query did not finish within 1 minute. Our method was significantly faster than the others on social networks regardless of k. By contrast, its improvements were small for road networks. If a graph had many trees, such as those in social networks, our algorithm could drastically prune trees without computing nodes using Lemma 1. Consequently, our method is better suited for large-scale complex networks than the aforementioned state-of-the-art methods.

Fig. 3
figure 3

Effect of k for kNN query time

5.3 Evaluating efficiency for attributed kNN queries

We then evaluate the running time efficiency of our proposed methods for attributed kNN queries defined in Problem 2.

5.4 Indexing time

Figure 4 shows the indexing time on the real-world attributed graphs shown in Table 3. The results of G-Tree\(+\)BAG are omitted from MovieLens-10K and MovieLens-50K since it did not finish indexing processes within 24 hours.

As shown in the figure, our proposed method, CT\(+\)BAG, significantly outperforms C-Tree\(+\)BAG under all examined conditions. Specifically, CT\(+\)BAG is faster than G-Tree\(+\)BAG by up to four orders of magnitude in terms of the indexing time. As shown in Table 3, the attributed graphs used in the experiments have dense and complicated structures. That is, the partitioning methods used in G-Tree require expensive costs to generate the hierarchical index, so G-Tree\(+\)BAG consumes a large indexing time for the attributed graphs. By contrast, owing to the core-tree properties in the graphs, our proposed methods successfully reduce the running costs for indexing.

A comparison between CT\(+\)BAG and CT\(+\)Naïve in Fig. 4 indicates that the impact of BAG index construction is negligible. Although BAG index additionally requires \(O(|V|+|A|)\) time compared to CT\(+\)Naïve (Sect. 4.1), Table 3 shows that \(|V|+|A| \ll |E|\) in practical attributed graphs. Hence, CT\(+\)BAG achieves almost the same index construction time compared to those of CT\(+\)Naïve, which constructs only CT index.

Fig. 4
figure 4

Indexing time for attributed kNN queries

Fig. 5
figure 5

Attributed kNN query time (\(k=10\) and \(\theta =0.3\))

5.5 Attributed kNN query time

We experimentally evaluate the attributed kNN querying time on the attributed graphs shown in Table 2. Figure 5 shows the attributed kNN query time on the attributed graphs with \(k = 10\) and \(\theta = 0.3\). Recall that G-Tree\(+\)BAG did not complete indexing on MovieLens-10K and MovieLens-50K as shown in Sect. 5.4. Thus, we omitted the results of G-Tree\(+\)BAG from those graphs.

The figure demonstrates that BAG index effectively reduces the attribute kNN query time; our proposed method CT\(+\)BAG achieves faster attributed kNN queries than CT\(+\)Naïve and G-Tree\(+\)BAG in all datasets. Owing to the core-tree property and BAG index, our algorithm outperforms the other methods by up to three orders of magnitude. As described in Sect. 4.2, our proposed method costs O(k) time for constructing the candidate nodes shown in Algorithm 4. By contrast, CT\(+\)Naïve requires to compute at least O(|V|) time because it needs to compute the attribute similarities from a query node to all nodes in V. Hence, CT\(+\)BAG efficiently finds the attributed kNN nodes since \(k \ll |V|\) in practice.

Fig. 6
figure 6

Effect of k for attributed kNN query time

Fig. 7
figure 7

Effect of \(\theta \) for attributed kNN query time

Fig. 8
figure 8

Effects of node attributes in indexing time for attributed kNN query

Fig. 9
figure 9

Effects of node attributes in kNN query time (\(k = 0.01 \times |V|\))

We then assess the impact of k for the attributed kNN query using CT\(+\)BAG indices. In each dataset, we varied the size of k as 10, 20, 30, 40, and 50 for various \(\theta \) settings. Figure 6 clearly demonstrates that CT\(+\)BAG significantly outperforms the other methods under all conditions examined in this evaluation. Although our algorithm gradually increases the attributed kNN query time as increasing the size of k, CT\(+\)BAG is still significantly faster than CT\(+\)Naïve on all experimental settings. As shown in Algorithm 5, CT\(+\)BAG incurs the additional costs (1) to extract candidate nodes \(V_{\theta }(q)\) and (2) to verify whether a node is included in \(V_{\theta }(q)\). Nevertheless, our approach is computationally efficient compared to CT\(+\)Naïve since Naïve approach incurs O(|V|) time for each query.

We also experimentally discuss the efficiency of our proposed methods by varying the size of use-specified threshold \(\theta \). In this evaluation, we varied the size of \(\theta \), which is a user-specified similarity threshold, as 0.1, 0.3, 0.5, and 0.7 for all datasets. As shown in Fig. 7, our proposed method is still more efficient than CT\(+\)Naïve and G-Tree\(+\)BAG under all conditions examined in the experiments. If \(\theta = 0.7\), G-Tree\(+\)BAG reduces its running time and the results of CT\(+\)BAG approaches to the running time of CT\(+\)Naïve. This is because that \(|V_{\theta }(q)|\) becomes significantly small when we set larger \(\theta \), e.g., \(\theta \ge 0.7\). Hence, in such cases, our method cannot prune an enough amount of trees by CT index. In practice, \(V_{\theta }(q)\) can be various sizes depending where the attributed graphs are obtained. Thus, Fig. 7 implies that our methods are applicable more divergent kNN use-cases since it achieves faster query processing than the others for all settings.

Finally, we assess the effects of adding node attributes for the kNN query. By using the MovieLens datasets, we compared the running time of CT\(+\)BAG and CT to discuss the impact of handling the node attributes. In this evaluation, we ignored the node attributes only for CT; it uses the graphs shown in Table 3 excluding the node attributes. Figures 8 and 9 show the indexing time and the kNN query time, respectively. Both figures demonstrate that CT + BAG consumes slightly larger runtimes than CT if the graph is small since CT\(+\)BAG requires the additional overheads for BAG index. In contrast, this overhead becomes negligible as increasing the graph size, e.g., MovieLens-50K. This is because that the time complexity of BAG index is smaller enough than the cost of CT index. For instance, as discussed in Sects. 3 and 4, BAG index imposes O(|V|) time in addition to O(|E|) time for constructing CT index. As shown in Tables 2 and 3, practical complex networks yield \(|V| \ll |E|\), which hence significantly mitigate the overhead of adding node attributes.

6 Related work

The problem of finding kNN nodes on a graph has been studied in many fields in recent years, particularly in social network analysis and database communities (Zhong et al. 2015; Li et al. 2019; Abeywickrama and Cheema 2017; Lee et al. 2012; Samet et al. 2008; Goldberg and Harrelson 2005; Kobayashi et al. 2021). We categorize the related works as follows.

Pruning algorithms for road networks. Previous studies primarily addressed kNN queries on road networks (Bast et al. 2006; Geisberger et al. 2008; Jing et al. 1998; Jung and Pramanik 2002; Samet et al. 2008; Sankaranarayanan et al. 2009). TNR Bast et al. (2006) and CH Geisberger et al. (2008) are the most representative cases. The basic idea of these approaches is to find and exploit a special class of nodes, namely transit nodes, for all nodes in a road network, which are the nodes included in the shortest paths. By pre-computing the shortest path distance between every node and a transit node, the methods can reconstruct the exact distances from the pre-computed values, and effectively prunes unnecessary computations during query processing. However, these methods generally require a large memory footprint because they must store pre-computed distances among all possible pairs of nodes and transit nodes. In addition, as reported in the literature Zhong et al. (2015), their pruning approaches do not work effectively for local nodes, where the two nodes are very close in a graph. Hence, it is still difficult to support kNN queries on large-scale complex networks, which have a large volume of nodes with complex edge connections.

Indexing algorithms for road networks. To alleviate the overhead of the aforementioned pruning algorithms, several indexing approaches have been recently proposed (Zhong et al. 2015; Li et al. 2019; Abeywickrama and Cheema 2017) for road networks. Before starting the kNN queries, these approaches pre-compute the shortest path distance among several pairs of nodes in a network. Then, they construct an index structures over the network. Once a user specifies a query node, these approaches explore the index and efficiently return the kNN nodes from the query node using the pre-computed distances.

For example, G-Tree (Zhong et al. 2015; Li et al. 2019) generates a hierarchical partition of a road network as an index such that each subgraph is equal in size and sparsely connected with the others using Metis Karypis and Kumar (1995). Subsequently, the G-Tree pre-computes the shortest path distances between the nodes within each subgraph. In the query processing phase, it starts to traverse kNN nodes from the subgraph including the query node, and skips unnecessary distance computations by using the pre-computed values in the index. Similarly, the ILBR Abeywickrama and Cheema (2017) generates an index based on a nested Voronoi diagram Okabe et al. (2000) and explores kNN nodes in the same manner.

However, as described in Sect. 1, these indexing methods suffer from the following three limitations in handling real-world graphs. First, they were not designed to handle complex networks with scale-free and small-world effects. Hence, these methods required very expensive running costs to construct the index. For example, as demonstrated in Sect. 5, G-Tree consumes more than 24 h for indexing, whereas our proposed method constructs an index within a few seconds. Second, the complex structures of networks impose a larger k-NN querying time than the time on road networks. This is because their indices effectively reduce the number of distance computations only if the networks are similar to the planar graph. In other words, they need to compute the distances again if the networks have complex edge connections, even though they pre-compute the distances within the subgraphs. Finally, these methods are not designed to handle node attributes attached to complex real-world networks. For these reasons, fast indexing and kNN querying algorithms, such as our proposed methods, are required to further enhance application quality using real-world attributed complex networks.

7 Conclusion

In this paper, we proposed a novel fast indexing algorithm to efficiently compute kNN queries on large-scale attributed complex networks. Our algorithm separately constructed the indexes for a core and trees, reducing both the index construction time and the kNN query time. Consequently, the computations for the trees were dynamically pruned while ensuring that the returned results were the same as a kNN search on a graph. Furthermore, we extended our indexing method so that it enables to handle node attributes included in each node during kNN queries. Our algorithm outperformed other state-of-the-art methods in the experiments conducted in this study by up to four orders of magnitude in terms of indexing and query processing time. These results indicate that our core-tree-aware indexing approaches are effective to reduce the expensive costs of graph-based kNN queries on complex networks.