1 Introduction

Recently, optimal location queries (OLQs) have been of intense interest in the spatial database community. Given a set of clients and a set of facilities, an optimal location query (OLQ) finds a location such that a certain cost function is optimized when a new facility is set up. Due to the widespread use of mobile devices and social network services, it is possible to collect the location information for each user. The increasing amount of location data can provide great opportunities to support market analysis and facility location decision-making problems; therefore, researchers have focused on the OLQ processing technique, such as the maximum influential location selection (Max-inf) problem. Specifically, the Max-inf problem in a road network is a real-life domain problem, so researchers have intensely studied such problems. The Max-inf problem in a road network [1,2,3] is a traditional problem that identifies the most influential object in a given network database. When potential objects P, existing facilities F, and client locations C are given, the Max-inf determines the location that attracts the most clients based on the assumption that each client uses the nearest facility, as depicted in Fig. 1. The influence of each facility is calculated using the following equations.

$$\begin{aligned}&\text {MaxINF(P)}={{\,\mathrm{arg\,max}\,}}_{p\in P}\sum _{c\in IS(p)} w(c), \end{aligned}$$
(1)
$$\begin{aligned}&\text {IS(p)}=\{c\in C|\forall f\in F, d(c,p)\le d(c,f)\}. \end{aligned}$$
(2)
Fig. 1
figure 1

Example of a Max-inf problem in a road network

In (2), d(, ) is a network distance function, and every client \(c\in C\) is associated with a positive weight w(c) that captures the importance of the client. Generally, every client has the same importance, so we set w(c) to 1 for all clients in this study. As mentioned above, the Max-inf is useful for market analysis, so it is beneficial in several real-life applications. For example, there are eight existing convenience stores \(\{f_1,f_2,\ldots ,f_8\}\), and eight clients, \(\{c_1,c_2,\ldots ,c_8\}\) in Fig. 1. We occasionally use the words “user,” “client,” and “customer” interchangeably. We assume that the query requester who is the owner of a new convenience store wants to know the best location to attract the most clients. Therefore, we aim to find the optimal location between four different potential locations, \(\{p_1,p_2,p_3,p_4\}\), to establish a new facility. As shown in Fig. 1, \(f_1\) is the nearest facility to \(\{c_8\}\), \(f_2\) is the nearest facility to \(\{c_1\}\), and \(f_3\) is the nearest facility to \(\{c_2, c_3\}\). If a new facility is established at \(p_1\), then it becomes the nearest facility to \(\{c_1,c_2\}\). In addition, if we select location \(p_2\), then we attract clients \(\{c_2,c_3\}\). However, if we select location \(p_3\), then only \(\{c_5\}\) is interested in the new facility. Nobody would be interested in \(p_4\). Therefore, \(p_1\) or \(p_2\) is the most attractive location.

Fig. 2
figure 2

Influence regions and the influence region overlapping problem

To solve this example, conventional approaches use computational geometry techniques with the exact locations of customers. In general, these techniques do not consider the location privacy problem. They assume that customers trust data analysts and agree to collect their exact location. However, people do not want to disclose their exact location, and location privacy has become an emerging issue in the spatial database community. To resolve such an issue, we propose a method to apply differential privacy to the Max-inf problem, that is a de-facto standard privacy preserving technique [4]. Differential privacy [5] exploits a randomized mechanism to add controlled noise in statistical query results. A naïve approach to applying differential privacy to a Max-inf problem is to add Laplace noise to (1). We call this value a noisy count. Then, we can find the best location by computing the potential location with the maximum noisy count. However, our previous work [6] discovered that this naïve approach suffers from poor accuracy while applying differential private techniques to the Max-inf problem in Euclidean space. In this paper, we show that the same problem occurs in a road network environment.

To resolve the Max-inf problem in a differentially private manner, we first invoke reverse nearest neighbor (RNN) queries in a road network for each potential location \(p_i\in P\) and identify clients whose nearest neighbor is \(p_i\). Then, we add Laplace noise to the number of clients and select the most influential location. However, if the potential locations are close to each other, then the client location information is disclosed recursively. For instance, the RNN clients of \(p_1\) and \(p_2\) are \(\{c_1,c_2\}\) and \(\{c_1,c_2,c_3\}\), respectively, as shown in Fig. 1. The intersection of the two RNN clients is \(\{c_1,c_2\}\), which indicates that their location is leaked twice by \(p_1\) and \(p_2\). This is because the influence regions of \(p_1\) and \(p_2\) overlap each other. The influence region is the network Voronoi region [7] of the corresponding potential location. The influence regions of \(p_1\) and \(p_2\) are described in Fig. 2. The red line is influence regions of \(p_1\), and the blue one is for \(p_3\). As shown in Fig. 2, these influence regions overlap each other, and clients, \(\{c_1,c_2\}\) are located in the overlapping region. This problem is called sequential composition [5] in differential privacy and requires dividing the privacy budget \(\epsilon \) by the number of potential locations. Thus, we should insert more noise proportional to the cardinality of P, which degrades the query accuracy exponentially due to the property of differential privacy. To remedy this issue, we propose a network Voronoi region partitioning method (NVPM) that partitions the influence regions of each potential locations to alter the problem from a sequential composition to parallel composition [8]. Although the NVPM mitigates the degradation of accuracy, it suffers from expensive computational cost. Therefore, we also propose a pruning technique called the network Voronoi envelope-based filtering method (NVEM), which precomputes the upper bound of the noisy count of the influence regions to reduce the search space.

In summary, our contributions are the following. First, we present a Max-inf problem in a road network with a differentially private user location. To the best of our knowledge, this study is the first attempt to apply differential privacy directly to the query processing of the Max-inf problem in a road network environment. Second, we present the influence region overlapping problem while applying differential privacy to the Max-inf problem in a road network. Third, we propose two algorithms to resolve the influence region overlapping problem, namely the NVPM and NVEM. Finally, we empirically study the properties of the proposed methods on actual road network datasets. The remainder of the paper is organized as follows. Section 2 reviews studies relevant to the Max-inf problem in a road network and differential privacy, and Sect. 3 formalizes the problem definitions and presents system models. Sections 4 and 5 propose two query-processing techniques, the NVPM and NVEM, respectively, and Sect. 6 provides the experimental results from actual datasets. Finally, Sect. 7 concludes the paper and presents directions for future works.

2 Related works and backgrounds

In this section, we briefly review previous studies of the Max-inf problem and differential privacy. The Max-inf problem is one of the OLQ, and it is closely relevant to RNN queries. Therefore, we first review RNN studies.

2.1 RNN queries

Korn et al. [9] first introduced RNN queries and proposed an r-tree-based query-processing technique to solve the RNN problem. Their approach is as follows. First, they computed the distance between each data object and its nearest neighbor, and they generated the circle with the nearest neighbor distance. Then, each point p is the RNN for query point q if its circle contains q. Yiu et al. [10] first studied the RNN queries in a road network environment. They proposed a filter and refinement approach called the eager algorithm, which is based on the network expansion technique. It expands the network from the query point q, like Dijkstra’s algorithm [11]. For each node n retrieved, the eager algorithm performs a range-NN query in the range of d(nq). If the data object p is retrieved, then the expansion does not proceed further from node n. As a result, it can prune node \(n'\), which is the shortest path between q and \(n'\) passes through n. Gao et al. [12] investigated reverse top-k Boolean spatial keyword queries (RkBSK) on road networks. An RkBSK retrieves the points that have query point q as one of their top-k Boolean spatial keyword query on the road network. The authors proposed the filter and refinement-based query-processing method with a connectivity-clustered access method (CCAM) [13]. Zhao et al. [14] proposed using reverse top-k geo-social keyword queries (RkGSK) in road networks. An RkGSK does not only consider spatial and textual information, but it also takes into account social data to identify prospective customers. The authors assume that customers are influenced by their friends, and it is important as much as the proximity and the textual similarity of objects. To support the RkGSK, they introduced a hybrid index structure the GIM-tree, which consists of the location, keywords, and social information for users and objects. They also proposed pruning strategies using the GIM-tree.

2.2 Max-inf problems

Cabello et al. [15] first introduced the Max-inf problem. As mentioned, the Max-inf finds the facility, o, that has the maximum influence. The influence refers to the number of clients whose nearest neighbor facility is o. The authors proposed the query-processing algorithm based on the nearest location circle (NLC). If the facilities are established outside of the NLC of c, it cannot be an RNN of c. Therefore, the optimal solution can be obtained by finding regions enclosed by the largest number of NLCs. Wong et al. [16] introduced the first polynomial-time algorithm, called the MaxOverlap, to support the Max-inf query. They also exploited NLCs to determine the optimal area in Euclidean space. They constructed the overlap table, which is an inverted list of each client. The overlap table consists of the pointer to each client c and the information for the other clients whose NLC overlaps with the NLC of c. Then, they determined the optimal region using the influence-based pruning technique. Yan et al. [17] presented a relaxation version of the Max-inf problem. They introduced the concept, the relaxed location circle (RNLC) of client c, which is the NLC with a buffer. They designed a kd-tree-based approach called fast influential location miner (FILM), which returns a small grid cell where all locations have an influence guarantee. Contrary to previous studies, the FILM finds a nearly optimal location in much less time. Zhang et al. [18] introduced the Min-dist location selection problem, which finds a location in a given rectangular query region Q that minimizes the average nearest neighbor distance. This query can be applied to the market analysis of a franchise restaurant because it returns the position that maximizes the benefit of the customers. To solve this problem, the authors proposed a progressive algorithm that divides the query region gradually until an answer set is found. Qi et al. [19] proposed the maximum nearest facility circle distance (MND) method to resolve the Min-dist problem efficiently. The MND is a variation of the minimum bounding rectangle, which is combined with the NLC. They exploited the r-tree with MND to solve the Min-dist problem. Moreover, Xiao et al. [1] and Yao et al. [2] first presented an optimal location selection query in a road network environment. They proposed the attraction set that records clients that are closer to each vertex than to their nearest facilities. They also exploited the graph partitioning technique to prune the unnecessary road segments with an attraction set. Chen et al. [20] solved the Min-max query in a road network. The Min-max query minimizes the maximum distance from a client to the closest facility. The authors introduced the nearest location component, which is a variant of NLC in a road network. They also proposed a piece-wise linear function-based algorithm with the nearest location component. Liu et al. [21] resolved the Min-max problem using multiple new servers. It is a more generalized version of the Min-max problem that finds locations of k servers to minimize the maximum distance of clients. The authors also employed the nearest location component and proposed the client cost line (CCL). The CCL is the piece-wise linear function of distance from a point to each client, and the intersection points of CCLs are potential server points. Chen et al. [3] presented kNN-based optimal location querying (KOLQ), which is a generalized Max-inf problem. The KOLQ assumes that each client can use the kNN facilities. It is more generalized than the nearest neighbor assumption. They extended the nearest location component to the k nearest location components and proposed a pruning technique by computing the upper bound count of each edge.

2.3 Differential privacy

Differential privacy was first introduced by Dwork [5], and its purpose is masking the evidence for a single individual. Its definition and properties are as follows.

Definition 1

(Differential privacy) Given two neighboring databases \(D_1\),\(D_2\), such that \(\Vert D_1-D_2\Vert _1\le 1\), a randomized mechanism M is \(\epsilon \)-differential private if the following condition holds for all \(S\subseteq Range(M)\):

$$\begin{aligned} Pr[M(D_1)\in S]\le exp(\epsilon )Pr[M(D_2)\in S]. \end{aligned}$$
(3)

where \(\epsilon \) is the privacy budget, and \(\Vert \cdot \Vert _1\) is the norm of a vector. The Laplace mechanism can achieve \(\epsilon \)-differential privacy that simply adds sampled Laplace noise to the query results. In this paper, we use the Laplace mechanism, but any other method can be applied, if it is \(\epsilon \)-differentially private. Differential privacy satisfies simple composition properties, which are called the sequential composition and parallel composition, as follows.

Definition 2

(Sequential composition) If we let \(M_1\) and \(M_2\) be two differential private mechanisms and their privacy budgets be \(\epsilon _1\) and \(\epsilon _2\), respectively, then their combination, \(M_{1,2}\), is \((\epsilon _1+\epsilon _2)\)-differentially private. Therefore, the composition of multiple differentially private mechanisms leads to a linear increase in the privacy budget or an increase in noise to maintain a fixed \(\epsilon \) total privacy budget.

Definition 3

(Parallel composition) Let \(M_i\) provide \(\epsilon _i\)-differential privacy and \(D_i\) be an arbitrary disjoint subset of database D. Then, the sequence of \(M_i(X\cap D_i)\) provides \(\max {(\epsilon _i)}\)-differential privacy.

The private spatial decomposition technique is the most related to our work among existing studies on differential privacy. Cormode et al. [22] first proposed differentially private spatial decomposition methods, which exploits the dataset partitioning technique to apply differential privacy to spatial data query processing. They instantiated a hierarchical tree structure, such as the kd-tree, to decompose a geometric space from large to small areas. Then, they added Laplace noise to the count for each node. Qardaji et al. [23] introduced the adaptive grid method to minimize the error of Laplace noise. They also identified the selection of partition granularity to balance errors from the noise error and uniformity error. Li et al. [24] proposed a two-stage mechanism for answering sets of range queries under differential privacy. The mechanism first partitions the domain into approximately uniform regions and then derives a bucket count using the workload queries. Zhang et al. [25] exploited a quadtree for spatial datasets. They proposed a threshold to determine the minimum of a subdomain and another threshold to limit the height of a quadtree. The amount of noise added to the quadtree can be limited to a constant using the two thresholds.

2.4 Limitation of existing studies

Most existing research on the Max-inf directly uses the information on clients to resolve the problem efficiently. If these techniques are applied to our problem, then the client information would be leaked recursively during query processing. Therefore, it is inevitable to insert more noise that is proportional to O(n|C|) where n is the number of iterations. This method leads to significant degradation of query accuracy. Meanwhile, previous studies of differential privacy have focused on the private spatial decomposition method to preserve spatial data privacy. However, private spatial decomposition is inadequate to handle the Max-inf problem due to data skewness which is called non-uniformity error. In other works [26,27,28,29] related to our problem, there are no studies have proposed a method to resolve the non-uniformity error. Generally, the facilities and users are not evenly distributed in a road network; therefore, it increases the non-uniformity error. In conclusion, existing studies are not suitable for solving the Max-inf problem in a differentially private manner; thus, we use the network Voronoi region-based approach and propose a pruning technique to remedy these problems.

Table 1 Frequently used symbols

3 Problem definition

We formally define the problems. Table 1 summarizes the notation frequently used in this paper. All data objects are represented by points in the road network. Thus, we define a road network first.

Definition 4

A road network is represented by graph \(G=(V,E)\) where V is a set of vertices, such as terminal points or road junctions, and E is a set of edges representing the network edges where each edge connects two vertices. Each edge \(e \in E\) is denoted as \(e(v_i,v_j)\) where \(v_i\) and \(v_j\) are starting and ending vertices, respectively. Any object point, o, including a potential location, an existing facility, and a client is located on an edge e.

In this paper, we assume that a road network is a weighted undirected graph because the direction is not important in our problem, and it is easily extensible to a directed graph. To resolve the Max-inf problem, we should compute the influence region of each potential location; therefore, we define the influence region in a road network as follows.

Definition 5

(Influence region in a road network) For any object o in the road network environment, the influence region of the potential location \(p\in P\) on the set \(F \cup \{p\}\) is a set of points in a road network that is denoted as V(p):

$$\begin{aligned} V(p) = \{o|d(p,o)\le d(p',o) , \forall p'\in F \cup \{p\} \wedge p\ne p'\}. \end{aligned}$$
(4)

The influence region of p encloses all of the client in IS(p), and only the clients in IS(p) are affected by p. We used the influence region to quickly identify IS(p) by redefining (2) as \(IS(p)=\{c|c\in C \wedge c\in V(p)\}\). Finally, the Max-inf problem in a road network using differentially private user locations is defined as follows.

Definition 6

(Max-inf in a road network with differentially private user locations (DP-Max-inf)) Given existing facilities F and client locations C, the DP-Max-inf determines a location that maximizes the influence with differentially private user locations between potential locations P, which are query-requester defined points on a road network.

$$\begin{aligned}&ncount(IS(p)) = \sum _{c\in IS(p)}w(c)+Lap(1/\epsilon '), \end{aligned}$$
(5)
$$\begin{aligned}&DPMaxINF(P) = {{\,\mathrm{arg\,max}\,}}_{p\in P}ncount(IS(p)). \end{aligned}$$
(6)

The DP-Max-inf simply changes the objective function from (1) to (5) by adding Laplace noise, and we denote this as a noisy count. To achieve \(\epsilon \)-differential privacy, \(\epsilon '\) must be \(\epsilon /|P|\) because of the influence region overlapping problem. However, it suffers from significantly poor performance accuracy, so we present an enhanced version of the sequential composition. First, it generates the influence regions of each potential location. Then, it determines the overlapping pairs of influence regions. Next, it counts the number of potential locations whose influence region overlap. Finally, it calculates (5) for each influence region with overlapping potential locations. This improves the accuracy by providing tighter noise bounds than the naïve approach because not all influential regions of the potential location overlap. However, it still suffers from a decrease in query accuracy as the number of potential locations increases. To remedy this problem, we propose a method that partitions an influence region with overlapping conditions.

4 Network Voronoi region-partitioning method

As mentioned in the last paragraph of the previous section, the sequential composition approach suffers from poor potential location scalability on the query accuracy. Therefore, we propose the NVPM to further improve accuracy.

4.1 Concept of the network Voronoi region-partitioning method

The NVPM exploits a parallel composition to mitigate the degradation of accuracy rather than using sequential composition. First, the NVPM constructs an influence region for each potential location. Next, it partitions each influence region using combinations of overlapping regions. Then, it computes the noisy count for each partitioned region and determines the nearly optimal location. As shown in Fig. 3, the influence regions of potential locations are partitioned using combinations of overlapping regions. The red line represents the influence region of \(p_1\), whereas the blue and green lines denote the influence regions of \(p_2\) and \(p_3\), respectively. These influence regions overlap each other, as depicted in Fig. 3. Therefore, the NVPM partitions them, so Region A consists only of the sub-region of \(V(p_3)\), which does not overlap with other influence regions. In contrasts, Region B is the overlapping region of \(V(p_1)\) and \(V(p_3)\). In addition, Region C is the overlapping region of \(V(p_1)\), \(V(p_2)\), and \(V(p_3)\). After partitioning the influence regions, the NVPM computes the noisy count for each partitioned region. For example, we can calculate the noisy count of Region A as \(ncount(\{o|o \in IS(p_1) \wedge o \notin IS(p_2) \wedge o \notin IS(p_3) \})\). Then, we can compute the total noisy count of \(IS(p_1)\) by summing the noisy counts of Regions A, B, and C.

Fig. 3
figure 3

The partitioned regions of influence regions

The simplest way to apply the NVPM directly is to follow the aforementioned steps. The shortest path tree (SPT) algorithm can be used to construct influence regions. The first step is to build an SPT from each vertex by expanding the edges of a road network until it meets the nearest existing facility. Then, each vertex v is associated with its nearest existing facility and nearest neighbor distance. Afterward, the influence region is generated by the construction of the SPT from a given potential location p. Let an expended edge be \(e(v, v')\). Assume that v is closer to p than \(v'\) and that the shortest paths from p to v are included in the influence region of p. Then, if d(pv) is less than the nearest neighbor distance of v, \(e(v,v')\) is included in the influence region. Otherwise, a boundary point on e is computed which is between two end vertices, v and \(v'\). Finally, the influence region of p is a set of road segments consisting of vertices and boundary points. Next, the combinations of overlapping regions must be computed to partition the influence regions. It takes exponential time complexity to compute the combinations of overlapping regions in a naïve way. In the last step, the noisy count is calculated using traversing for each partitioned regions. This naïve approach is inefficient because it traverses the road network repeatedly. Therefore, we introduce a method to construct an influence region, partitioned region, and noisy count simultaneously.

4.2 Query processing for network Voronoi region partitioning

In real-life applications, the size of a road network is too large to process in the main memory, so we designed a disk-based storage model to support the algorithms. We exploited the CCAM [13] index, which is the most popular index structure for query processing in a road network environment. It consists of an adjacency file and an inverted file for points in the road network, as depicted in Fig. 4. The CCAM collects the points of interests (POIs) that are on the same edge to store in one group. For every group, it maintains the information of the edge where the POIs are located and the number of POIs. In our setting, we stored the existing facilities and client locations as POIs with their labels. We computed the distance from the vertex using smaller ID for each POI. For example, an existing facility \(f_1\) is located on \(e(v_1, v_2)\), so the distance is calculated from \(v_1\) as 1. A B+-tree is employed to facilitate efficient access to adjacency files, where the vertices are sorted by the identifiers.

Fig. 4
figure 4

An example of CCAM index structure

Table 2 The overlapped Voronoi segment list example

As previously explained, the NVPM consists of the influence region construction, partitioned region generation, and computation of the noisy count. These steps are based on the network expansion approach, so we can reduce computation time by performing them concurrently during one traversal. Algorithm 1 provides the overall query-processing steps of our proposed NVPM using CCAM index. We first define the data structures used in Algorithm 1. The overlapped Voronoi segment is for the overlapping region of influence regions and is defined as follows.

Definition 7

An overlapped Voronoi segment (OVS) is a sub-edge that contains the following attributes.

  • eid: the identifier of the corresponding edge,

  • plist: the list of corresponding potential locations,

  • start: the starting point of OVS on the edge,

  • end: the ending point of OVS on the edge, and

  • clist: the list of customers located on the OVS, where each customer contains identification and the location.

Without loss of generality, we assume that vertices are ordered by their identifiers. Therefore, the following condition holds if the OVS is on \(e(v_i, v_j)\) where \(i < j\).

$$\begin{aligned} 0 \le start \le end \le d(v_i, v_j) \end{aligned}$$
(7)

Moreover, the influence region is represented by a set of triplets (eidstartend), where the attributes are the same as those of the OVS. While traversing a new edge \(e_i\), the NVPM checks the OVS of \(e_i\). If the OVS of \(e_i\) does not exist, it generates a new OVS of \(e_i\) with the corresponding potential location, start, end, and clients. Otherwise, it finds the overlapping regions with the existing start and end points. In addition, it partitions the OVS with a new segment. For example, the influence region of \(p_1\) is \(\{(e_1, 2.5, 3), (e_3, 0, 4.5), (e_{12}, 0, 0.5)\}\) in Fig. 3. Then, the OVS of \(e_1\) is \((e_1, \{p_1\}, 2.5, 3, \{ \})\), the OVS of \(e_3\) is \((e_3, \{p_1\},\) \(0, 4.5, \{c_1, c_2\})\), and the OVS of \(e_{12}\) is \((e_{12}, \{p_1\}, 0, 0.5, \{\})\). Next, the NVPM constructs the influence region of \(p_2\), which is \(\{(e_3, 1, 5.5)\}\). Because the OVS of \(e_3\) already exists, the NVPM partitions it as \((e_3, \{p_1\}, 0, 1, \{c_1\})\), \((e_3, \{p_1, p_2\}, 1,\) \(4.5, \{c_2\})\) and \((e_3, \{p_2\}, 4.5, 5.5, \{c_3\})\) (assuming that the ties are broken). Then, the final OVSs are generated and listed in Table 2. After generating the influence regions for all potential locations, we can obtain the noisy count of each partitioned region by adding the Laplace noise to the number of clients in the OVS that has the same plist. For example, \(ov_1\), \(ov_2\), and \(ov_6\) have the same plist, \(\{p_1, p_3\}\); therefore, they are merged, and the noisy count is \(1 + Lap(1/\epsilon )\).

figure a

5 Network Voronoi envelope-based filtering

In this section, we propose an NVEM. The NVPM suffers of long query-processing time because it does not use any pruning strategies. Even if no overlapping regions, the NVPM requires O(|P||V|log|V|) time complexity to generate the influence regions of potential locations where |V|log|V| is for the SPT construction, and |P| is for the iteration of potential locations. Thus, the time complexity of the NVPM increases proportionally to the cardinality of the potential locations. However, we observe that customers and facilities are generally skewed in real-world problems. Thus, most potential locations are much less influential than the optimal location. Therefore, we can reduce query-processing time if we know the upper-bound noisy count of the potential location in advance. Motivated by this observation, we propose a network Voronoi envelope to compute the upper-bound noisy count and a pruning technique to filter out unnecessary potential locations.

5.1 Network Voronoi envelope

To compute the upper-bound noisy count, we first find the upper-bound region of each influence region. We define this region as the network Voronoi envelope.

Definition 8

A network Voronoi envelope (NVE) is a partitioned sub-graph of G, such that the terminal vertices are existing facilities.

Figure 5 illustrates the NVEs of the previous example. As in the definition of NVE, the road network is partitioned by eight cells (a to g) where each cell has the existing facilities as terminal vertices. Then, the potential locations correspond to each cell where they are located. For example, \(p_1\), \(p_2\), and \(p_3\) are in cell d, and only \(p_4\) is in cell f. Thus, cell d is an NVE of \(p_1\), \(p_2\) and \(p_3\), and cell f is an NVE of \(p_4\). Let \(NVE(p_i)\) be a corresponding NVE of the potential location \(p_i\). Then, the influence region of \(p_i\) is inside of the \(NVE(p_i)\), because the network expansion of \(p_i\) stops when it meets the terminal facilities of \(NVE(p_i)\). Then, the upper-bound noisy count of \(p_i\), which is denoted \(UBncount(p_i)\), is the number of clients in the \(NVE(p_i)\) with Laplace noise. To explain this easily, we present the example of the upper-bound noisy count without noise. There are four clients, \(\{c_1, c_2, c_3, c_8\}\) in cell d, whereas only two clients, \(\{c_4, c_5\}\) exist in cell f. The \(UBncount(p_4)\) is 2, whereas \(UBncount(p_1)\), \(UBncount(p_2)\), and \(UBncount(p_3)\) are 4. The upper-bound noisy count is used to prune unnecessary potential locations during query processing. Because the NVEM uses the information of clients twice, we divide \(\epsilon \) into \(\epsilon _1\) and \(\epsilon _2\). In addition, \(\epsilon _1\) is used to compute the upper-bound noisy count, and \(\epsilon _2\) is used for query processing the NVPM. Then, the overall query processing of NVEM is \((\epsilon _1 + \epsilon _2)\) - differentially private.

Fig. 5
figure 5

Network Voronoi envelopes example

Fig. 6
figure 6

A B+tree with NVEs

5.2 CCAM index with network Voronoi envelope

As mentioned, the NVEM can filter out potential locations whose upper-bound noisy count is less than the current best location during query processing. Thus, we modified the B+-tree of CCAM to hold identifiers of edges instead of vertices and to store aggregation values of upper-bound noisy counts as depicted in Fig. 6. The leaf nodes of the B+-tree are composed of pointers to edges and the upper-bound noisy count for each edge. Let NVE(e) be a set of NVEs that intersect with e. Then, the upper-bound noisy count of e is the maximum value of NVE(e).

$$\begin{aligned} UBncount(e) = \max _{NVE\in NVE(e)}UBncount(NVE) \end{aligned}$$
(8)

The intermediate node stores the maximum upper-bound noisy count of the children nodes. Let N be the node of B+-tree and N.children be its children nodes. Then, the upper-bound of the noisy count of N is calculated as follows:

  1. (i)

    N is an intermediate node (including the root node):

    $$\begin{aligned} UBncount(N) = \max _{e\in N.children}UBncount(e) \end{aligned}$$
    (9)
  2. (ii)

    Otherwise:

    $$\begin{aligned} UBncount(N) = \max _{N'\in N.children}UBncount(N') \end{aligned}$$
    (10)

The intermediate nodes of the B+-tree only use the noisy count of NVEs that are partitioned cells of the road network. Therefore, it does not require an additional privacy budget due to the post-processing property of differential privacy [5]. In conclusion, each node of our modified B+-tree has the following attributes:

  • keys: the separation values that divide its sub-trees,

  • UBncount: the upper-bound noisy count, and

  • children: the list of pointers to children nodes.

5.3 Query processing of the network Voronoi envelope-based filtering

The NVEM traverses the B+-tree in the best-first search approach to find the corresponding edge of potential locations. First, the NVEM groups the potential locations on the same edge. Then, the NVEM traverses from the root node of the B+-tree and classifies potential locations according to their corresponding edge IDs while visiting each node. During traversing B+-tree, NVEM prunes out unnecessary potential locations whose upper-bound noisy cost is greater than the current best cost. Algorithm 2 presents the overall query-processing steps of the NVEM. The NVEM maintains the triplet \((N_i,UBncount(N_i),plist(N_i))\) in a maxheap sorted by \(UBncount(N_i)\), where \(N_i\) is the node of B+-tree and \(plist(N_i)\) is a set of corresponding potential locations. Each p in \(plist(N_i)\) has the edge id where p is located and the flag whether p is already checked or not. Then, the NVEM dequeues the triplet \((N_i,UBncount(N_i),plist(N_i))\) of the maxheap at each step and performs the appropriate tasks according to the following cases.

  • \(N_i\) is a leaf node of the B+-tree:

    1. 1.

      Filter out any \(p \in plist(N_i)\) already checked.

    2. 2.

      For each \(p \in plist(N_i)\), compute the NVE(p) and find the potential locations \(P_{NVE(p)}\) inside of NVE(p).

    3. 3.

      Invoke NVPM with \(P_{NVE(p)}\) for each \(p \in plist(N_i)\).

  • \(N_i\) is an intermediate node of the B+-tree:

    1. 1.

      Partition \(plist(N_i)\) according to the key of child node \(N_c\).

    2. 2.

      Enqueue \((N_c,UBncount(N_c),plist(N_c))\) to the max-heap for each child node \(N_c\).

We explain the algorithm with previous example. First, the NVEM enqueues \((N_0, 4, \{p_1, p_2, p_3, p_4\})\) to the maxheap. Then, the NVEM dequeues \((N_0, 4, \{p_1, p_2, p_3, p_4\})\) at the next iteration. The NVEM partitions \(\{p_1, p_2, p_3, p_4\}\) according to their locations. As shown in Fig. 6, \(p_1\) and \(p_2\) are on e3, \(p_4\) is on e4, and \(p_3\) is on e10. Further, e3, e4 are on the sub-trees of \(N_1\), and e10 is a descendent of \(N_2\). Therefore, the NVEM partitions the group of potential locations as \(\{p_1, p_2, p_4\}\) and \(\{p_3\}\). Next, it enqueues \((N_1, 4, \{p_1, p_2, p_4\})\) and \((N_2, 4, \{p_3\})\) to the maxheap. Again, the NVEM dequeues \((N_1, 4, \{p_1, p_2, p_4\})\) from the maxheap. The NVEM does not partition the \(\{p_1, p_2, p_4\}\) because they are on the same node, \(N_4\). Therefore, the NVEM enqueues \((N_4, 4, \{p_1, p_2, p_4\})\). Next, the NVEM dequeues \((N_2, 4, \{p_3\})\), and the NVEM enqueues \((N_7, 2, \{p_3\})\) to the maxheap. Then, \((N_4, 4, \{p_1, p_2, p_4\})\) pops out, and the potential locations, \(\{p_1, p_2, p_4\}\), correspond to the same NVE. Therefore, it invokes the NVPM with \(\{p_1, p_2, p_4\}\) and computes the noisy count of each potential location. The noisy count of \(p_2\) is 3, and it is the current best location. Finally, the upper-bound noisy count of \((N_7, 2, \{p_3\})\) is less than the current best noisy count; therefore, the algorithm terminates and returns \(p_2\) as the nearly optimal location.

figure b
Table 3 Experimental settings

6 Experimental results

In this section, we report the experimental results of the proposed method. Section 6.2 studies the scalability of proposed methods with varying dataset sizes. Section 6.3 evaluates the performance of the methods with varying parameters. Section 6.4 reveals the index build time and maintenance cost.

6.1 Experimental environments

We evaluate our proposed methods, NVPM and NVEM using two real road networks, namely CAL [30] and NA [31]. No dataset was proper for our problem in road network settings, so we generated synthetic datasets based on CAL and NA. To evaluate the performance with generality, we generated 10 sub-datasets of each road network, and the experiments are conducted 10 times with varying parameters. The results of existing facility scalability are excluded because they had the opposite effect from those of potential locations. The CAL dataset is an actual road network dataset of California that contains 21,048 vertices, 21,693 edges, and 85,070 POIs. We divided the POIs into potential locations and existing facilities. Next, we randomly generated client locations based on the Zipfian distribution, varying the skewness of the data. We sampled the existing facility from the zipf distribution and computed the distance between it and nearest vertex. Then, we generated the locations of each user from the zipf distribution using the calculated distance. In contrasts, the NA dataset is an actual road network dataset for North America that contains 175,813 vertices and 179,179 edges. It has no POI information, so we generated the POIs from a uniform distribution. The dataset configuration is the same as that of the CAL dataset. Table 3 summarizes the experimental setup in which the bold text indicates the default settings. The experiments were conducted using an Ubuntu 14.04 operating system with an Intel(R) Xeon(R) CPU E5-2620 v2@2.10 GHz and 64 GB RAM. We compared the enhanced version of sequential composition method (SCM) we explained in Sect. 3, the private spatial decomposition method (PSD), and proposed methods (NVPM and NVEM). The PSD is based on PrivSem [32], which is the most recent private spatial decomposition method in a road network environment. To determine the nearly optimal location with PSD index, we constructed each influence region of the potential locations. Then, we found the overlapping areas of the influence region and PSD index. Next, we computed the noisy count of each potential location with the overlapping ratio. Finally, we determined the potential location using the highest noisy count. We measured mean absolute error (MAE), query accuracy, and query-processing time for the optimal location selection. If we let \(p_o(i)\) be an actual optimal location and \(p_m(i)\) be the query output of each method in the \(i_{th}\) iteration, then, the query error indicates the difference between the number of clients attracted by \(p_o(i)\) and \(p_m(i)\). The query accuracy is the number of times that \(p_m(i)\) is the same as \(p_o(i)\). With the number of test cases, \(n_t\), the MAE and query accuracy are calculated as follows.

$$\begin{aligned}&MAE = \sum _{i=1}^{n_t}\dfrac{|IS(p_o(i))|-|IS(p_m(i))|}{n_t}, \end{aligned}$$
(11)
$$\begin{aligned}&ACC = \sum _{i=1}^{n_t}\dfrac{I(p_m(i))}{n_t}, I(x) = {\left\{ \begin{array}{ll} 0 &{} x\ne p_o(i) \\ 1 &{} x = p_o(i) \end{array}\right. }. \end{aligned}$$
(12)
Fig. 7
figure 7

The scalability of potential locations

Fig. 8
figure 8

The scalability of clients

6.2 Scalability test

Figure 7 illustrates the results of the potential location scalability. We evaluated the accuracy, MAE, and query time by varying the number of potential locations. As depicted in the figures, all methods degrade the MAE and accuracy as the number of potential locations increases. However, the NVPM outperforms other methods in terms of MAE and accuracy. The accuracy of the NVEM is slightly worse than the NVPM, but the query-processing time is significantly faster than that of the others. The accuracy of the SCM is better than NVEM when the number of potential locations is small. However, it becomes worse than NVEM as the number of potential locations increases because the influence region overlapping problem becomes worse as |P| increases. The results of client scalability are described in Fig. 8, showing that the accuracy of all methods increases as the number of users increases. In contrast, the MAE decreases as the number of users increases because that the effect of Laplace noise on the query results decreases as the number of clients increases. The query-processing time increases in all methods as the number of users increases, except for the NVEM. Because the clients are skewed in a road network, the pruning probability increases as the number of clients increases. Moreover, the PSD is less affected than other methods in the query-processing time. The PSD just computes the overlapping areas of each influence region and the PrivSem index; thus, the query-processing time is independent of the number of clients.

Fig. 9
figure 9

The effect of \(\epsilon \)

Fig. 10
figure 10

The effect of \(\epsilon \) ratio

Fig. 11
figure 11

The effect of \(\theta \)

Fig. 12
figure 12

Index build time and storage

6.3 Parameter test

The second set of experiments demonstrates the performance evaluation using varying parameters. We evaluated the effectiveness of the total privacy budget \(\epsilon \), \(\epsilon \) ratio, and skewness parameter \(\theta \). As explained in Sect. 5, we divided the total privacy budget \(\epsilon \) into \(\epsilon _1\) and \(\epsilon _2\). The \(\epsilon \) ratio \((\alpha )\) indicates the budget amount used to construct the index. We evaluated the accuracy and MAE by varying the total privacy budget \(\epsilon \) from 0.25 to 4. The experimental results are indicated in Fig. 9. As expected, the accuracy increases in the overall methods as \(\epsilon \) increases, except for the PSD. As mentioned in Sect. 2, the PSD suffers from the uniformity error; thus, its accuracy does not improve as \(\epsilon \) increases. In general, the NVPM outperforms the other methods, and the NVEM performs second best in terms of accuracy and MAE. However, the query-processing time of NVEM is the best, and it is not affected by the privacy budget. Figure 10 displays the results of the experiments that vary the \(\epsilon \) ratio, \(\alpha \). They demonstrate that the accuracy and MAE of the NVEM decrease as \(\alpha \) increases, and they are the best when \(\alpha \) is 0.3 in CAL and 0.1 in NA. The noise added in the index construction and query processing is the same at \(\alpha = 0.3\) in CAL and \(\alpha = 0.1\) in NA. Moreover, the NVEM is second best in terms of MAE and accuracy except when the \(\alpha \) is 0.9. In addition, \(\alpha \) does not affect the query processing time of the NVEM. Figure 11 displays the results of the experiments at varying client skewness, \(\theta \). As \(\theta \) increases, the clients become more skewed. Every scheme does not have good accuracy when the clients are uniformly distributed, but the schemes become better as \(\theta \) increases. The query-processing time is not affected by \(\theta \) for most methods, but the NVEM becomes slightly faster as \(\theta \) increases.

6.4 Index build time and storage

Figure 12 describes the results of the last experiments for the index build time and storage. We compared the index build time of the NVEM with the CCAM index structures. Because the scalability of potential locations does not affect the build time of the NVEM, we conducted the experiments for the clients. The NVEM exhibits inferior performance in all datasets. However, the index build time is not on-demand service time, but precomputation time. In addition, the index size of NVEM is larger than that of CCAM, but the storage of NVEM is almost the same as the number of clients increases.

7 Conclusion and future works

In this paper, we presented the influence region overlapping problem while applying differential privacy to the Max-inf problem in a road network. To resolve the problem, it is inevitable to insert more noise due to the sequential composition of differential privacy. However, this naïve method suffers from poor query accuracy. Therefore, we proposed a novel method called the network Voronoi partition method to process the Max-inf query with differentially private user locations in a road network. We also proposed a Voronoi envelope-based filtering heuristic to significantly reduce the query-processing time with a slight degradation in accuracy. We conducted experiments with the real road network datasets and demonstrated that the NVEM is the best method in terms of accuracy and query-processing time. Compared to the NVMP, the NVEM had a less than 10% degradation of the accuracy, while query-processing time is reduced more than 90%. In conclusion, our proposed methods are applicable for determining an adequate location in a differentially private manner, such as the analysis of trade areas and establishment of public facilities, while preserving the individual locations of users. In the future, we aim to optimize the parameter of the \(\epsilon \) ratio and extend the proposed methods to other OLQs.