1 Introduction

Location-based services (LBS), such as point-of-interest (POI) search and route suggestion, have been increasingly used in recent years. Consequently, a large amount of location traces (time-series location trails) are accumulated in an LBS provider. The LBS provider can provide the location traces to a third party to perform various geo-data analyses such as finding popular POIs [54], modeling human mobility patterns [24], and semantic annotation of POIs [51]. However, the disclosure of the traces can lead to a serious privacy issue because they may include sensitive locations, e.g., homes and hospitals. In addition, several methods have been developed to identify the users’ behaviors [27, 53] or to re-identify traces from pseudonymized traces [12, 30, 31].

Many privacy-preserving location synthesizers have been proposed to address this privacy issue [3, 4, 13, 16, 28, 29]. These approaches first train a generative model from real location traces. Then they generate synthetic traces based on the trained generative model. Ideally, synthetic traces preserve various statistical features, such as a population distribution [54] and transition matrix [47], while strongly protecting user privacy. The preserved statistical features play a significant role in the geo-data analysis. Moreover, applications of synthetic traces are not limited to geo-data analysis. For example, they are useful for research purposes [20, 33] and competitions [28, 38].

Existing location synthesizers, however, do not consider friendship information between users. In particular, friends tend to visit the same place at the same time [50]. This event is called a co-location [35, 36]. Co-locations of friends are important to make synthetic traces more realistic. For example, a recent study [50] shows that there is a correlation between co-locations and friendships on Twitter. Therefore, synthetic traces including co-locations of friends can be used as a dataset to study the effectiveness of friend suggestion algorithms based on locations.

In this paper, we propose a novel location synthesizer that synthesizes location traces including co-locations of friends. To preserve co-location information, our proposed method trains two parameters: friendship probability and co-location count matrix. The friendship probability represents a probability that two users are friends. The co-location count matrix is composed of a co-location count for each time instant and each location. Thus, it models a location that is likely to be visited by friends at a certain time period, e.g., amusement parks in the daytime and restaurants at night.

Our location synthesizer works as follows. First, we train the two parameters using a friendship (or social) graph and location traces, which we call a training graph and training traces, respectively. We train the friendship probability from a training graph and the co-location count matrix from training traces. Then, we generate a synthetic graph using the friendship probability. We generate co-locations in synthetic traces using the synthetic graph and the co-location count matrix. Finally, we generate other locations using an existing location synthesizer providing DP [4, 13, 16] based on the Markov chain model.

One promising feature of our location synthesizer is that both the two parameters in our synthesizer provide strong privacy guarantees: differential privacy (DP) [10], which is known as a gold standard for data privacy. The friendship probability provides \(\epsilon _1\)-(bounded) node DP [21], a strong type of DP on graphs, for the training graph. The co-location count matrix provides \(\epsilon _2\)-user-level DP [9], which is a strong type of DP on time-series data, for the training traces. The parameters \(\epsilon _1\) and \(\epsilon _2\) are non-negative real values and called the privacy budgets [10]. It is well known that DP strongly protects user privacy when they are small, e.g., smaller than 1 or 2 [18, 34].

We use the existing synthesizer providing \(\epsilon _3\)-user-level DP for the training traces [4, 13, 16] to generate locations other than co-locations. Then, by the composition theorem in DP [10], the entire synthetic traces provide \(\epsilon _1\)-node DP for the training graph and \((\epsilon _2 + \epsilon _3)\)-user-level DP for the training traces. As with [4, 13, 16], we use the privacy budget \(\epsilon _3\) to preserve statistical features, such as a population distribution [54] and transition matrix [47]. We additionally use \(\epsilon _1\) and \(\epsilon _2\) to preserve the information about co-locations, which has not been considered in the existing synthesizers. \(\epsilon _1\) and \(\epsilon _2\) are additional costs required to incorporate co-locations into synthetic traces. Table 1 summarizes the existing synthesizers [4, 13, 16] and our synthesizer.

Through comprehensive experiments, we show that our synthesizer preserves the information about co-locations and other statistical features (e.g., population distribution, transition matrix) with reasonable privacy budgets, e.g., \(\epsilon _1 = 0.2\) and \(\epsilon _2= \epsilon _3 = 1\).

Table 1 The existing and our location synthesizers

1.1 Our contributions

In summary, we provide the following contributions:

  • We propose a novel location synthesizer that generates location traces including co-locations of friends. To our knowledge, we are the first to synthesize traces including co-locations. Our synthesizer models the information about co-locations of friends with two parameters: friendship probability and co-location count matrix. The friendship probability provides node DP, and the co-location count matrix provides user-level DP.

  • We evaluate our synthetic traces using two real datasets: the Foursquare [50] and Gowalla [25] datasets. Our experimental results show that our synthetic traces preserve the information about co-locations and other statistical features (e.g., population distribution, transition matrix) while satisfying DP with reasonable privacy budgets, e.g., 0.2-node DP (\(\epsilon _1 = 0.2\)) for the training graph and 2-user-level DP (\(\epsilon _2= \epsilon _3 = 1\)) for the training traces.

This paper is a significant extension of the previously published conference paper [32]. The main enhancements are as follows:

  • In [32], we did not discuss the overall privacy guarantee for the entire dataset including the training graph and training traces. In this paper, we define total DP (Definition 3) to provide the overall privacy guarantee for the entire dataset and prove that our synthesizer provides total DP (Sect. 4.5).

  • In [32], we generated a synthetic graph based on the Erdös-Rényi (ER) model [5]. However, it is well known that the ER model does not reflect an actual graph property. Specifically, most actual graphs have a power-law degree distribution [1], whereas the ER model does not. In this paper, we also generate a synthetic graph based on the Barabási-Albert (BA) model [1], which has a power-law degree distribution and therefore is much more realistic. Our BA graph also provides node DP. We show through experiments that our BA graph has a degree distribution close to a training graph (Sects. 4 and 5).

  • In [32], we evaluated our synthesizer using only one dataset: the Foursquare dataset. In this paper, we add the Gowalla dataset to make the evaluation more comprehensive (Sect. 5).

  • In [32], we did not show examples of co-locations preserved in our synthesizer. In this paper, we show ten pairs of locations and time instants where co-location events are most likely to occur for each dataset. Using these examples, we show how well our synthesizer preserves the information about co-locations (Sect. 5).

  • In [32], we evaluated the utility of the probability distributions (e.g., population distribution, transition matrix) by only the mean absolute error (MAE) and the mean squared error (MSE). In this paper, we also evaluate the utility of them using the Kullback–Leibler (KL) divergence and the Jensen–Shannon (JS) divergence. These utility metrics also make our evaluation more comprehensive (Sect. 5).

  • In [32], we did not provide the details of how to apply Privelet [49] in our location synthesizer. In this paper, we provide the details of applying Privelet and provide a proof that Privelet provides user-level DP (Appendix 1).

1.2 Paper organization

The rest of this paper is organized as follows. In Sect. 2, we review the previous work closely related to ours. In Sect. 3, we introduce some preliminaries for our work. In Sect. 4, we propose our location synthesizer. In Sect. 5, we show our experimental results. Finally, in Sect. 6, we conclude this paper with future directions of this work.

2 Related work

2.1 Co-locations

A co-location refers to an event that two users are in the same place at the same time. In particular, we focus on a co-location of friends and generate synthetic traces on the basis of the co-location.

Co-locations have been widely studied, especially through the impact on location privacy and the relationship with friendships. Olteanu et al. [35, 36] showed that co-location information improves the accuracy of location inference attacks. The users’ benefits of sharing co-locations and the impact of co-locations on location privacy were also studied in [37]. Yang et al. [50] showed that there is a correlation between co-location information and friendships on Twitter. However, to the best of the authors’ knowledge, there are no existing studies that use co-locations to synthesize location traces.

2.2 Location synthesizers

The generation of synthetic location traces has long been recognized as an important research subject; see [3, 29] for detailed surveys. Bindschaeder and Shokri [3] developed a synthetic location generation algorithm considering semantic features of locations. For example, most people tend to stay a night at their homes, which are geographically different but semantically the same. Their synthesizer preserves this kind of information while satisfying their own privacy notion called plausible deniability. Bindschaeder et al. [4] also proposed a synthetic data generator for synthesizing various types of data with DP. This synthesizer can be applied to various data, including location traces [29]. In the case of location traces, the synthetic data generator in [4] trains a transition matrix common to all users as a generative model (see [29] for details). Some studies [13, 16] proposed more complicated algorithms for generating synthetic traces with DP using a transition matrix common to all users. Murakami et al. [29] proposed a method to generate synthetic traces with high utility based on an observation that there should be a small number of typical groups of users, e.g., those who often go to malls and those who frequently go to offices. Specifically, they clustered a transition matrix for each user using tensor factorization. They also applied a modified version of their algorithm to a location anonymization contest [28]. The synthesizers in [3, 28, 29] do not provide DP, whereas the synthesizers in [4, 13, 16] provide DP.

Thus, many studies have been made on the artificial generation of location traces. However, to the best of the authors’ knowledge, no generation method using co-locations has been proposed so far. As explained above, it is shown in [50] that there is a correlation between co-locations and friendships, i.e., friends tend to be in the same place at the same time. Hence, to synthesize more realistic traces, it is important to take the co-location information into account for generating location traces.

Finally, a recent study [48] empirically showed that synthetic data did not provide a better trade-off between privacy and utility for data analysis than a traditional anonymization (generalization and deletion) technique satisfying k-anonymization. In the case of geo-data analysis, the result in [48] indicates that location synthesizers might not provide a better empirical trade-off between privacy and utility than location obfuscation (e.g., generalization, deletion) methods providing k-anonymity [2, 6]. However, synthesizers are useful for not only data analysis but also generating a dataset for research [20, 33] or competitions [28, 38]. These important applications cannot be realized by generalization and deletion. We aim at generating a synthetic yet realistic dataset that is useful for research or competitions by preserving various statistical features including co-locations of friends.

3 Preliminaries

In this section, we introduce some preliminaries for our work. In Sect. 3.1, we define basic notations used in this paper. In Sect. 3.2, we explain friendship graphs and location traces. In Sect. 3.3, we describe our threat model and review differential privacy.

3.1 Basic notations

Below, we define the basic notations used in this paper. Let \(\mathbb {R}\), \(\mathbb {R}_{\ge 0}\), \(\mathbb {N}\), and \(\mathbb {Z}_{\ge 0}\), be the set of real numbers, non-negative real numbers, natural numbers, and non-negative integers, respectively. For a finite set \({\mathcal {Z}}\), let \({\mathcal {Z}}^*\) be the set of all finite sequences of elements of \({\mathcal {Z}}\). Let \({\mathcal {P}}({\mathcal {Z}})\) be the power set of \({\mathcal {Z}}\). For \(a\in \mathbb {N}\), let \([a] = \{1, 2, \ldots , a\}\). We represent a matrix as a bold capital letter, such as \(\textbf{M}\). We denote the i-th row of the matrix \(\textbf{M}\) by \(\textbf{M}_i\) and the (ij)-th element of \(\textbf{M}\) by \(\textbf{M}_{ij}\).

Table 2 Basic notations

We follow the notations in [29] to define users, locations, and time. Specifically, let \({\mathcal {U}}\) be a finite set of users in training data. Let \(n \in \mathbb {N}\) be the number of users, i.e., \(n = |{\mathcal {U}}|\). Let \(u_i \in {\mathcal {U}}\) be the i-th user, i.e., \({\mathcal {U}} = \{u_1, \ldots , u_n\}\). We consider discrete locations. For example, we can divide an area of interest into some regions or extract some POIs. Let \({\mathcal {X}}\) be a finite set of locations. Let \(x_{i}\) be the i-th location. We also consider a discrete version of time, called time instant. For example, we can round down minutes to a multiple of 20. Let \({\mathcal {T}}\) be a finite set of time instants. Let \(t_i \in {\mathcal {T}}\) be the i-th time instant.

The basic notations used in this paper are shown in Table 2. Symbols that are not explained in Sect. 3.1 will be explained in Sect. 3.2.

3.2 Friendship graphs and location traces

Friendship graphs  A friendship (or social) graph includes friendship information between any pair of two users. It is represented as an undirected graph, where a node represents a user and an edge represents that two users are friends. The friendship graph is also represented as an adjacency matrix of size \(n \times n\). In the adjacency matrix, an element between two friends is set to 1, and an element between two users who are not friends is set to 0. Diagonal elements are set to 0 because there are no friend relationships between users and themselves.

Figure 1 shows examples of the friendship graph and the corresponding adjacency matrix in training data. In this work, we call them the training graph and the training adjacency matrix. In this example, user \(u_1\) is a friend with \(u_3\) and \(u_4\). User \(u_6\) is a friend with only \(u_4\).

Fig. 1
figure 1

Examples of the training graph and the corresponding adjacency matrix (\(n=6\))

Formally, let \(\textbf{A}\in \{0,1\}^{n \times n}\) be a training adjacency matrix. \(\textbf{A}_i\) is the i-th row of \(\textbf{A}\). In Fig. 1, \(\textbf{A}_1 = (0,0,1,1,0,0)\), \(\ldots \), \(\textbf{A}_6 = (0,0,0,1,0,0)\).

Location traces  A location trace includes a location in each time instant. A pair of the location and the time instant is called an event [29, 45].

Figure 2 shows examples of the location traces in training data, which we call the training traces. We mark co-location events with red. In this example, users \(u_1\) and \(u_2\) have a co-location event at location \(x_4\) and time instant \(t_2\). Users \(u_2\) and \(u_3\) have a co-location event at \(x_1\) and \(t_3\).

Fig. 2
figure 2

Examples of the training traces (\(n=6\), \(|{\mathcal {X}}|=5\), \(|{\mathcal {T}}|=4\))

Let \({\mathcal {E}} = {\mathcal {X}} \times {\mathcal {T}}\) be a finite set of events. Let \({\mathcal {R}} = {\mathcal {U}} \times {\mathcal {E}}^*\) be a finite set of traces. Let \({\mathcal {S}} \subseteq {\mathcal {R}}\) be a finite set of training traces. Let \(s_i \in {\mathcal {S}}\) be the i-th training trace. In Fig. 2, \(s_1 = (u_1, (x_2,t_1), (x_4,t_2),(x_5,t_3),(x_1,t_4))\), \(\ldots \), \(s_6 = (u_6, (x_5, t_1), (x_4, t_2), (x_3, t_3), (x_4, t_4))\), and \({\mathcal {S}} = \{s_1, s_2, s_3, s_4, s_5, s_6\}\). Note that although each trace includes four events in Fig. 2, we do not assume the length of the training trace is the same among all users. In fact, the length of the training trace is different in our experiments.

3.3 Threat model and differential privacy

Threat model  We use training data that includes a friendship graph and location traces to generate synthetic traces. We assume that the number n of users in training data is public. We also assume that an adversary has any background knowledge other than the training data. The adversary obtains the synthetic traces and attempts to violate user privacy in the training data on the basis of the synthetic traces and the background knowledge. For example, the adversary performs a membership inference attack [39, 44], which infers whether a location trace of a specific user is included in the training data.

To strongly protect user privacy in the training data from the adversary with any background knowledge, we use differential privacy (DP) [8, 10] as a privacy metric. DP provides user privacy against adversaries with any background knowledge. Below, we explain DP for training graphs and training traces.

DP for training graphs  There are two types of DP on graphs: edge DP and node (or vertex) DP [21, 46]. Edge DP hides the existence of one edge, i.e., friendship. In contrast, node DP hides the existence of all edges connected to one node. Therefore, node DP guarantees much stronger privacy than edge DP and much more difficult to attain [15, 19, 41]. To strongly protect user privacy in the training graphs, we tackle this challenge and use node DP as a privacy metric.

The original definition of node DP [15] follows the direction of unbounded DP [22], where a neighboring graph is obtained by removing one node. Since we assume that n is public (as described in Sect. 3.3 “Threat Model”), we use node DP in [21] that follows the direction of bounded DP [22], where a neighboring graph is obtained by changing at most \(n-1\) edges of one node. Bounded node DP is also much stronger than edge DP because it hides all sensitive edges of a user.

Fig. 3
figure 3

Example of neighboring adjacency matrices \(\textbf{A}\) and \(\textbf{A}'\) in (bounded) node DP [21] (\(n=6\))

Formally, (bounded) node DP in [21] considers two neighboring adjacency matrices \(\textbf{A}\) and \(\textbf{A}'\) such that \(\textbf{A}'\) is obtained by an arbitrary rewiring of edges connected to one node. In other words, \(\textbf{A}\) and \(\textbf{A}'\) differ in at most \(n-1\) edges of one user. Fig. 3 shows an example of two neighboring adjacency matrices \(\textbf{A}\) and \(\textbf{A}'\) (\(n=6\)). In this example, \(\textbf{A}\) and \(\textbf{A}'\) differ in \(n-1=5\) edges connected to \(u_3\).

Using the notion of neighboring adjacency matrices, (bounded) node DP is defined as follows.

Definition 1

[\(\epsilon _1\)-(bounded) node DP [21]] Let \(\epsilon _1 \in \mathbb {R}_{\ge 0}\). A randomized mechanism \({\mathcal {M}}_1\) with domain \(\{0,1\}^{n \times n}\) provides \(\epsilon _1\)-node DP if for any two neighboring adjacency matrices \(\textbf{A}, \textbf{A}' \in \{0,1\}^{n \times n}\) that differ in at most \(n-1\) edges of one user and any \(z \in \textrm{Range}({\mathcal {M}}_1)\),

$$\begin{aligned} \Pr [{\mathcal {M}}_1(\textbf{A}) = z] \le e^{\epsilon _1} \Pr [{\mathcal {M}}_1(\textbf{A}') = z]. \end{aligned}$$
(1)

By (1), if \(\epsilon _1\) is close to 0, then \(\textbf{A}\) and \(\textbf{A}'\) are almost equally likely. Thus, an adversary who obtains the output of \({\mathcal {M}}_1\) cannot determine whether it is come from \(\textbf{A}\) or \(\textbf{A}'\). If the privacy budget \(\epsilon _1\) is small (e.g., smaller than 1 or 2 [18, 34]), each user’s privacy is strongly protected.

DP for training traces  For time-series data such as training traces, there are two types of DP: event-level DP and user-level DP [9]. Event-level DP protects one event in time-series data. In contrast, user-level DP protects the entire history (i.e., entire time-series data) of one user. Thus, user-level DP guarantees much stronger privacy than event-level DP and is much more difficult to attain. To strongly protect user privacy in training traces, we use user-level DP.

Formally, user-level DP for training traces considers two neighboring sets \({\mathcal {S}}\) and \({\mathcal {S}}'\) of traces such that \({\mathcal {S}}'\) is obtained by changing the entire trace of one user in \({\mathcal {S}}\). For example, consider a set \({\mathcal {S}}'\) of traces obtained by changing \(s_2\) in Fig. 2 to \(s'_2\) as follows: \(s'_2 = (u_2, (x_1,t_1),(x_1,t_2),(x_1,t_3))\) (note that the trace length can also be changed). In this example, \({\mathcal {S}}\) and \({\mathcal {S}}'\) are neighboring sets.

Using neighboring sets of traces, user-level DP is defined as follows.

Definition 2

[\(\epsilon _2\)-user-level DP] Let \(\epsilon _2 \in \mathbb {R}_{\ge 0}\). A randomized mechanism \({\mathcal {M}}_2\) with domain \({\mathcal {P}}({\mathcal {R}})\) provides \(\epsilon _2\)-user-level DP if for any two neighboring sets \({\mathcal {S}}, {\mathcal {S}}' \subseteq {\mathcal {R}}\) of traces that differ in the entire trace of one user and any \(z \in \textrm{Range}({\mathcal {M}}_2)\),

$$\begin{aligned} \Pr [{\mathcal {M}}_2({\mathcal {S}}) = z] \le e^{\epsilon _2} \Pr [{\mathcal {M}}_2({\mathcal {S}}') = z]. \end{aligned}$$
(2)

By (2), if \(\epsilon _2\) is close to 0, an adversary who obtains the output of \({\mathcal {M}}_2\) cannot determine whether it is come from \({\mathcal {S}}\) or \({\mathcal {S}}'\). Thus, the privacy of each user is strongly protected when the privacy budget \(\epsilon _2\) is small.

Note that the neighboring sets \({\mathcal {S}}\) and \({\mathcal {S}}'\) have the same number of users. Thus, user-level DP follows the direction of bounded DP [22] in the same way as node DP in [21].

Total DP  In this work, we use a dataset that includes both the training graph (adjacency matrix) \(\textbf{A}\) and the training traces \({\mathcal {S}}\). Assume that we use an algorithm providing \(\epsilon _1\)-node DP for \(\textbf{A}\) and an algorithm providing \(\epsilon _2\)-user-level DP for \({\mathcal {S}}\). Then, a natural question would be: what is the total privacy guarantee of these algorithms for a single user? To answer this question, we define total DP.

The above dataset can be expressed as a tuple \((\textbf{A}, {\mathcal {S}})\). We consider two neighboring tuples \((\textbf{A}, {\mathcal {S}})\) and \((\textbf{A}', {\mathcal {S}}')\) such that \((\textbf{A}', {\mathcal {S}}')\) is obtained by changing the entire trace and at most \(n-1\) edges of one user, i.e., an arbitrary rewiring of all personal data of one user.

Using the neighboring tuples, we define total DP:

Definition 3

[\(\epsilon \)-total DP] Let \(\epsilon \in \mathbb {R}_{\ge 0}\). A randomized mechanism \({\mathcal {M}}\) with domain \(\{0,1\}^{n \times n} \times {\mathcal {P}}({\mathcal {R}})\) provides \(\epsilon \)-node DP if for any two neighboring tuples \((\textbf{A}, {\mathcal {S}})\) and \((\textbf{A}', {\mathcal {S}}')\) that differ in the entire trace and at most \(n-1\) edges of one user and any \(z \in \textrm{Range}({\mathcal {M}})\),

$$\begin{aligned} \Pr [{\mathcal {M}}(\textbf{A}, {\mathcal {S}}) = z] \le e^{\epsilon } \Pr [{\mathcal {M}}(\textbf{A}', {\mathcal {S}}') = z]. \end{aligned}$$
(3)

\(\epsilon \) is a total privacy budget over the entire dataset \((\textbf{A}, {\mathcal {S}})\). We can answer our question above by using total DP:

Proposition 1

Let \(\epsilon _1, \epsilon _2 \in \mathbb {R}_{\ge 0}\). Assume that randomized mechanisms \({\mathcal {M}}_1\) with domain \(\{0,1\}^{n \times n}\) and \({\mathcal {M}}_2\) with domain \({\mathcal {P}}({\mathcal {R}})\) provide \(\epsilon _1\)-node DP and \(\epsilon _2\)-user-level DP, respectively. In addition, assume that \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\) are independently executed. Then, the independent execution of \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\) provides \((\epsilon _1 + \epsilon _2)\)-total DP.

Proof

The randomness in \({\mathcal {M}}_1\) is independent of the randomness in \({\mathcal {M}}_2\). Thus, given inputs \(\textbf{A}\in \{0,1\}^{n \times n}\) and \({\mathcal {S}} \subseteq {\mathcal {R}}\), the outputs of \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\) are independent of each other. Therefore, for any two neighboring tuples \((\textbf{A}, {\mathcal {S}})\) and \((\textbf{A}', {\mathcal {S}}')\) and for any \(z_1 \in \textrm{Range}({\mathcal {M}}_1)\) and \(z_2 \in \textrm{Range}({\mathcal {M}}_2)\), we have

$$\begin{aligned}&\Pr [({\mathcal {M}}_1(\textbf{A}), {\mathcal {M}}_2({\mathcal {S}})) = (z_1, z_2)] \\&\le e^{\epsilon _1} \Pr [({\mathcal {M}}_1(\textbf{A}'), {\mathcal {M}}_2({\mathcal {S}})) = (z_1, z_2)] ~~(\text {by }{(1)}) \\&\le e^{\epsilon _1 + \epsilon _2} \Pr [({\mathcal {M}}_1(\textbf{A}'), {\mathcal {M}}_2({\mathcal {S}}')) = (z_1, z_2)] ~~(\text {by }{(2)}), \end{aligned}$$

which proves Proposition 1. \(\square \)

Proposition 1 means that if we provide \(\epsilon _1\)-node DP for the training graph and \(\epsilon _2\)-user-level DP for the training traces, then the total privacy budget \(\epsilon \) over the entire dataset \((\textbf{A}, {\mathcal {S}})\) is the sum of \(\epsilon _1\) and \(\epsilon _2\), i.e., \(\epsilon = \epsilon _1 + \epsilon _2\).

Both node DP in [21] and user-level DP follow the direction of bounded DP, as explained above. Thus, total DP also follows the direction of bounded DP.

4 Proposed method

We propose a novel location synthesizer that generates synthetic traces including co-locations of friends. In Sect. 4.1, we describe the overview of our synthesizer. We explain the details of our synthesizer in the remaining subsections. In Sects. 4.2 and 4.3, we explain how to train the friendship probability and the co-location count matrix, respectively. In Sect. 4.4, we explain how to generate synthetic traces based on the friendship probability and the co-location count matrix. In Sect. 4.5, we provide end-to-end privacy analysis of our synthesizer.

4.1 Overview

Fig. 4
figure 4

Overview of our location synthesizer. We first train the friendship probability \(p'\), the co-location count matrix \(\textbf{Q}{'}\), and an existing location synthesizer [4, 13, 16] based on the Markov chain model. Then, we calculate a synthetic graph based on \(p'\). Finally, we generate co-locations using the synthetic graph and \(\textbf{Q}{'}\) and other locations using the existing location synthesizer

Figure 4 shows the overview of our location synthesizer. The main feature of our synthesizer is that it generates synthetic traces including co-locations. The synthetic traces preserve a friendship probability (i.e., how likely two users will be friends) and a co-location count matrix (i.e., how likely a co-location event will happen at a certain location for each time instant). The two parameters in our synthesizer strongly protect user privacy; the friendship probability provides node DP and the co-location count matrix provides user-level DP.

Our location synthesizer uses a location dataset that includes both location traces and a friendship graph (e.g., Foursquare dataset [50] and Gowalla dataset [25]) as training data. Below, we briefly explain how to train parameters from the training data and how to generate synthetic traces from the parameters.

Training parameters  From a training graph, we first calculate a friendship probability \(p \in [0,1]\), which represents a probability that two users are friends. Then we add the Laplace noise [10] to p to obtain a noisy friendship probability \(p'\) providing node DP.

From training traces, we first calculate a co-location count matrix \(\textbf{Q}\in \mathbb {Z}_{\ge 0}^{|{\mathcal {T}}| \times |{\mathcal {X}}|}\), which comprises a co-location count for each time instant and each location. Specifically, we calculate \(\textbf{Q}\) by simply counting co-locations between friends. Then we add noise to \(\textbf{Q}\) to obtain a noisy co-location count matrix \(\textbf{Q}{'}\) providing user-level DP.

The simplest approach to providing DP for \(\textbf{Q}{'}\) is to add the Laplace noise to each element in \(\textbf{Q}\). We refer to this approach as the Laplace mechanism.

Another approach is to apply Privelet (for one-dimensional nominal data) [49], a DP mechanism based on a wavelet transform, to each row of \(\textbf{Q}\). Privelet uses a nominal wavelet transform to a one-dimensional count vector and adds the Laplace noise to each wavelet coefficient, i.e., each node in a tree structure. When a category (or tree structure) of locations is known, Privelet significantly reduces the amount of noise for each category. For example, categories (e.g., “travel & transport” and “shopping”) and subcategories (e.g., “train station”, “airport”, “bookstore”, and “discount store”) of POIs are available in the Foursquare dataset [7] and the Gowalla dataset [25]. Thus, we can use Privelet to provide DP for \(\textbf{Q}{'}\) with a much smaller amount of noise for each POI category.

Other methods than the Laplace mechanism and Privelet include the hierarchical method in [40] and the matrix mechanism in [23, 52]. Specifically, the hierarchical method in [40] finds the optimal branching factor in a tree that minimizes the mean square error of a range query. However, this optimization method cannot be applied to our setting where a tree structure of locations (i.e., POI categories and subcategories) is given in advance. In addition, the matrix mechanism in [23, 52] is inefficient and provides worse utility than the hierarchical method, as described in [34]. Therefore, we focus on the Laplace mechanism and Privelet and evaluate these two mechanisms in our experiments.

Generating synthetic traces  Based on two parameters \(p'\) and \(\textbf{Q}{'}\), we generate synthetic traces including co-locations. First, we calculate a synthetic graph based on the friendship probability \(p'\). In this paper, we propose to calculate two types of graphs: a graph based on the Erdös-Rényi model (the ER graph) [5] and a graph based on the Barabási-Albert model [1] (the BA graph). Note that both the ER and BA graphs are generated using only a single friendship probability \(p'\) providing node DP. Thus, by the immunity to the post-processing [10], both the ER and BA graphs also provide node DP.

The BA graph is more realistic than the ER graph in that it has a power-law degree distribution. In our experiments, we show that the BA graph preserves statistical properties of the training graph, including the average degree (number of friends) and the degree distribution.

After synthesizing the friendship graph, we generate co-locations of friends at a specific location and a time instant based on the synthetic graph and \(\textbf{Q}{'}\). The generated co-locations preserve the information about co-locations in the training data, e.g., friends tend to meet at a restaurant from 7PM to 8PM. After generating co-locations, we generate other locations using an existing differentially private location synthesizer [4, 13, 16] based on the Markov chain model, which models human movement patterns as a transition matrix.

The existing synthesizers in [4, 13, 16] provide user-level DP for the training traces. Then, by the composition theorem [10], the entire synthetic traces provide node DP for the training graph and user-level DP for the training traces.

Remark  As shown in Fig. 4, we add DP noise to the friendship probability p and the co-location count matrix \(\textbf{Q}\) independently from one another. However, there might be a correlation between p and \(\textbf{Q}\), and it might be possible to add smaller noise by considering the correlation.

For example, assume that a training dataset is collected from students in a class. In this case, many users (students) tend to be friends, and co-locations tend to happen in the school. Suppose we publish \(\textbf{Q}{'}\) that includes a large count in the school. Then, it may suffice to add small noise to p because, given \(\textbf{Q}{'}\), it is highly unlikely that the friendship probability is small. Thus, the amount of noise might be reduced by considering the correlation between p and \(\textbf{Q}\).

We argue that this kind of improvement is extremely challenging in practice because the correlation information itself needs to satisfy DP, e.g., we may need other datasets to obtain differentially private correlation information. Therefore, we treat p and \(\textbf{Q}\) independently and leave the improvement of our algorithm using the correlation for future work.

4.2 Training the friendship probability \(p'\)

Training \(p'\)  Below, we explain how to train the noisy friendship probability \(p' \in [0,1]\) in detail.

We first calculate the friendship probability p by simply calculating the proportion of edges in the training graph, i.e., the proportion of 1 s in non-diagonal elements of the training adjacency matrix \(\textbf{A}\). For example, we can calculate p as \(p=\frac{14}{6 \times 5} = 0.467\) in Fig. 1. If \(n=1\), then we calculate p as \(p=0\).Footnote 1 After calculating p, we calculate \(p'\) by adding the Laplace noise with mean 0 and scale \(\frac{2}{n \epsilon _1}\). For \(b\in \mathbb {R}_{\ge 0}\), let \(\text {Lap}(b)\) be the Laplace noise with mean 0 and scale b. Then we calculate \(p'\) as follows: \(p' = p + \text {Lap}\left( \frac{2}{n \epsilon _1}\right) \).

DP of \(p'\)  Let \({\mathcal {M}}_1^{ \text {Lap}}: \{0,1\}^{n \times n} \rightarrow [0,1]\) be a randomized mechanism that takes a training adjacency matrix \(\textbf{A}\in \{0,1\}^{n \times n}\) as input and outputs \(p' \in [0,1]\). \({\mathcal {M}}_1^{ \text {Lap}}\) has the following privacy guarantee.

Theorem 1

\({\mathcal {M}}_1^{ \text {Lap}}\) provides \(\epsilon _1\)-node DP.

Proof

Let \(f: \{0,1\}^{n \times n} \rightarrow [0,1]\) be a function that takes a training adjacency matrix \(\textbf{A}\in \{0,1\}^{n \times n}\) as input and outputs the friendship probability \(p \in [0,1]\). Let \(\Delta f\) be the global sensitivity [10] of f given by

$$\begin{aligned} \Delta f = \max _{\textbf{A}\sim \textbf{A}'} |f(\textbf{A}) - f(\textbf{A}')|, \end{aligned}$$
(4)

where \(\textbf{A}\sim \textbf{A}'\) represents that \(\textbf{A}\) and \(\textbf{A}'\) are neighboring matrices that differ in at most all edges of one user.

Below, we upper bound the global sensitivity \(\Delta f\). Let \(d \in \mathbb {Z}_{\ge 0}\) be the number of 1 s in \(\textbf{A}\). \(\Delta f\) takes the maximum value when \(n-1\) edges of a user are removed (or added). If \(n \ge 2\), we have

$$\begin{aligned}&|f(\textbf{A}) - f(\textbf{A}')| \nonumber \\&\le \textstyle {\frac{d}{n(n-1)} - \frac{d-2(n-1)}{n(n-1)} \le \frac{2(n-1)}{n(n-1)} = \frac{2}{n}.} \end{aligned}$$
(5)

Note that the denominator of \(f(\textbf{A}')\) is \(n(n-1)\) (rather than \((n-1)(n-2)\)) because we consider a bounded version of node DP [21] that does not remove a node to obtain a neighboring graph, as described in Sect. 3.3.Footnote 2

If \(n=1\), then \(|f(\textbf{A}) - f(\textbf{A}')| = |0 - 0| = 0\). Thus, for any \(n \in \mathbb {N}\), \(\Delta f\) in (4) can be upper bounded as \(\Delta f \le \frac{2}{n}\).

Adding the Laplace noise \(\text {Lap}\left( \frac{\Delta f}{\epsilon _1}\right) \) to p provides \(\epsilon _1\)-DP [10]. Therefore, the randomized mechanism \({\mathcal {M}}_1^{ \text {Lap}}\), which adds \(\text {Lap}\left( \frac{2}{n \epsilon _1}\right) \) to p, provides \(\epsilon _1\)-node DP.\(\square \)

4.3 Training the co-location count matrix \(\textbf{Q}{'}\)

Training \(\textbf{Q}{'}\)  Next, we explain how to train the co-location count matrix \(\textbf{Q}{'} \in \mathbb {Z}_{\ge 0}^{|{\mathcal {T}}| \times |{\mathcal {X}}|}\) in detail.

We first calculate the co-location count matrix \(\textbf{Q}\in \mathbb {Z}_{\ge 0}^{|{\mathcal {T}}| \times |{\mathcal {X}}|}\), which includes the number of co-locations for each time instant and each location, from the training traces. Here, to upper bound the global sensitivity in DP, we introduce an upper limit \(c\in \mathbb {Z}_{\ge 0}\) on the number of co-locations per user. In other words, if the number of co-locations reaches c, then the user’s co-locations are not read anymore. This technique is called trimming in DP [26]. Figure 5 shows an example of training \(\textbf{Q}\) in the case where \(c=3\). In this example, the co-location of users \(u_2\) and \(u_4\) are not read, because three co-locations of \(u_2\) have already been read.

Fig. 5
figure 5

Overview of calculating the co-location count matrix \(\textbf{Q}\)

After calculating \(\textbf{Q}\), we add noise to \(\textbf{Q}\) to obtain \(\textbf{Q}{'}\). To add noise to \(\textbf{Q}\), we use the Laplace mechanism or apply Privelet (for one-dimensional nominal data) [49] to each row of \(\textbf{Q}\). The Laplace mechanism simply adds \(\text {Lap}(\frac{c}{\epsilon _2})\) to each element of \(\textbf{Q}\). Privelet performs the wavelet transform to a tree structure of locations. Then it adds the Laplace noise to a wavelet coefficient for each node in the tree.

For more details of the algorithm of Privelet, see Appendix 1.

DP of \(\textbf{Q}{'}\)  Let \({\mathcal {M}}_2^{ \text {Lap}}: {\mathcal {P}}({\mathcal {R}}) \rightarrow \mathbb {Z}_{\ge 0}^{|{\mathcal {T}}| \times |{\mathcal {X}}|}\) be the Laplace mechanism, which takes training traces \({\mathcal {S}} \subseteq {\mathcal {R}}\) as input and outputs \(\textbf{Q}{'} \in \mathbb {Z}_{\ge 0}^{|{\mathcal {T}}| \times |{\mathcal {X}}|}\) by adding \(\text {Lap}(\frac{c}{\epsilon _2})\) to each element of \(\textbf{Q}\). \({\mathcal {M}}_2^{ \text {Lap}}\) has the following privacy guarantee.

Theorem 2

\({\mathcal {M}}_2^{ \text {Lap}}\) provides \(\epsilon _2\)-user-level DP.

Proof

By the trimming, we read at most c co-locations per user from \({\mathcal {S}}\). Thus, changing the entire trace of one user in \({\mathcal {S}}\) will change each element of \(\textbf{Q}\) by at most c. Therefore, the global sensitivity of the co-location count in each element of \(\textbf{Q}\) is at most c. Since \({\mathcal {M}}_2^{ \text {Lap}}\) adds \(\text {Lap}(\frac{c}{\epsilon _2})\) to each element of \(\textbf{Q}\), it provides \(\epsilon _2\)-user-level DP.\(\square \)

Let \({\mathcal {M}}_2^{\text {Privelet}}: {\mathcal {P}}({\mathcal {R}}) \rightarrow \mathbb {Z}_{\ge 0}^{|{\mathcal {T}}| \times |{\mathcal {X}}|}\) be Privelet. As with the Laplace mechanism, \({\mathcal {M}}_2^{\text {Privelet}}\) adds the Laplace noise based on the global sensitivity. Thus, \({\mathcal {M}}_2^{\text {Privelet}}\) has the following privacy guarantee:

Theorem 3

\({\mathcal {M}}_2^{ \text {Privelet}}\) provides \(\epsilon _2\)-user-level DP.

See Appendix 1 for the proof.

4.4 Generating synthetic traces

Figure 6 shows the generation of synthetic traces using our location synthesizer. After training the friendship probability \(p'\) and the co-location count matrix \(\textbf{Q}{'}\), our synthesizer generates a synthetic trace for each of n users as follows.

Fig. 6
figure 6

Overview of generating synthetic traces in our proposed method

Fig. 7
figure 7

Example of complementing locations using the synthesizer in [4]. We train a transition matrix providing user-level DP from training traces. Then we complement locations using the Viterbi algorithm, which finds the most likely sequence of locations, i.e., Viterbi path

Algorithm 1
figure a

Generating synthetic traces in our proposed method. Here, we represent synthetic traces \({\mathcal {S}}_{\textrm{syn}}\) as a set of triplets \((u_i,x_k,t_l)\) of user \(u_i\), location \(x_k\), and time instant \(t_l\) for ease of presentation.

  1. 1.

    Generate a synthetic graph \(G'\) (the ER or BA graph) with n nodes from \(p'\). We explain how to generate the ER graph and the BA graph in detail at the end of Sect. 4.4.

  2. 2.

    Calculate a matrix called a co-location probability matrix \(\textbf{R}' \in [0,1]^{|{\mathcal {T}}| \times |{\mathcal {X}}|}\) from \(\textbf{Q}{'}\). Specifically, we calculate \(\textbf{R}'\) by normalizing each row of \(\textbf{Q}{'}\) so that the sum of the rows is 1. Here, we add the absolute value of the minimum value in \(\textbf{Q}{'}\) to all elements so that each element in \(\textbf{Q}{'}\) (hence \(\textbf{R}'\)) does not have a negative value.

  3. 3.

    Synthesize \(\theta \in \mathbb {N}\) co-locations of friends from \(G'\) and \(\textbf{R}'\). Specifically, we iterate the following three steps until we obtain \(\theta \) co-locations: (i) randomly select a pair of friends from \(G'\); (ii) randomly select a time instant from \({\mathcal {T}}\); (iii) randomly generate a co-location at the selected time instant using the corresponding row of \(\textbf{R}'\). In step (iii), if one of the two users has already had a co-location at the selected time instant, we use it as a co-location for consistency with the previously generated co-location.

  4. 4.

    Synthesize the other locations in n synthetic traces using a transition matrix of the existing DP location synthesizer [4, 13, 16] based on the Markov chain model. Specifically, we complement the remaining locations using the Viterbi algorithm [42].

    Fig. 7 shows an example of complementing the remaining locations using the synthesizer in [4]. In the case of location traces, the synthesizer in [4] trains a transition probability matrix \(\textbf{Z}\in [0,1]^{|{\mathcal {X}}| \times |{\mathcal {X}}|}\) common to all users from training traces and adds the Laplace noise to each element to provide \(\epsilon _3\)-user-level DP (see [29] for details). Based on the trained matrix, we complement the remaining locations using the Viterbi algorithm [42], which finds the most likely sequence of locations, i.e., Viterbi path. In our experiments, we use the synthesizer in [4] to complement the remaining locations.

The number \(\theta \) of co-locations is a parameter in our location synthesizer. In our experiments, we set \(\theta \) to various values. It is also possible to calculate the frequency of co-locations with DP noise from training traces and set \(\theta \) on based on the noisy co-location frequency.

Algorithm 1 shows the proposed algorithm when we use the synthesizer in [4] to complement locations other than co-locations. Line 1, lines 2 to 7, lines 8 to 22, and line 23 correspond to steps (1), (2), (3), and (4), respectively, in Fig. 6. In Appendix 1, we explain the ViterbiAlgorithm function (line 23) in detail. The GenerateGraph function (line 1) takes the number n of users and the noisy friendship probability \(p'\) as input and outputs a synthetic graph \(G'\). The ER model (with parameters n and \(p'\)) and the BA model (with parameters n and \(\frac{p'(n-1)}{2}\)) satisfy this requirementFootnote 3, as explained below.

Generating the ER graph  The Erdös-Rényi (ER) model [5] has two parameters \(n \in \mathbb {N}\) and \(q \in [0,1]\) and is denoted by \({\mathcal {G}}(n, q)\). The ER model \({\mathcal {G}}(n, q)\) is a simple graph generation model that randomly and independently generates each edge between n nodes with probability q. Since the friendship probability \(p'\) represents the probability that two users are friends, we set \(q=p'\). In other words, we generate a synthetic graph \(G'\) based on \({\mathcal {G}}(n, p')\). Note that \(G'\) is generated using only a single friendship probability \(p'\).

Recall that we calculate \(p'\) by \(p' = p + \text {Lap}(\frac{2}{n \epsilon _1})\). Since the expectation of \(\text {Lap}(\frac{2}{n \epsilon _1})\) is 0, \(p'\) is an unbiased estimate of p. Thus, the expected number of edges in \(G'\) is equal to the number of edges in the training graph. In other words, our ER graph preserves the average degree (number of friends) of the training graph.

However, the ER model does not have a power-law degree distribution [1]. Therefore, our ER graph does not reflect an actual graph property well.

Generating the BA graph  The Barabási-Albert (BA) model [1] is a graph generation model that has a power-law degree distribution. It has two parameter \(n \in \mathbb {N}\) and \(\lambda \in \mathbb {Z}_{\ge 0}\) is denoted by \({\mathcal {B}}(n, \lambda )\). The BA model \({\mathcal {B}}(n, \lambda )\) generates a graph with n nodes by sequentially attaching new nodes so that each new node is connected to \(\lambda \) existing nodes. An edge is connected to an existing node with the probability proportional to its degree.

The BA graph with parameter \(\lambda \) has about \(n\lambda \) edges. In contrast, the training graph has \(\frac{pn(n-1)}{2}\) edges. These numbers coincide when \(\lambda = \frac{p(n-1)}{2}\). Therefore, we set \(\lambda =\frac{p'(n-1)}{2}\) (we round decimals) and synthesize \(G'\) based on the BA model \({\mathcal {B}}(n, \frac{p'(n-1)}{2})\). Again, note that \(G'\) is generated using only a single friendship probability \(p'\).

Since \(p'\) is an unbiased estimate of p, our BA graph preserves the average degree of the training graph. In addition, our BA graph has a power-law degree distribution. In our experiments, we show how well our BA graph preserves these statistical properties of the training graph.

DP of the ER and BA graphs  Let \({\mathcal {M}}_1^{\textrm{ER}}\) (resp. \({\mathcal {M}}_1^{\textrm{BA}}\))\(: \{0,1\}^{n \times n} \rightarrow \{0,1\}^{n \times n}\) be a randomized mechanism that takes a training adjacency matrix \(\textbf{A}\) as input and outputs an adjacency matrix of the ER (resp. BA) graph \(G'\). Then, we have the following privacy guarantees.

Theorem 4

Both \({\mathcal {M}}_1^{\textrm{ER}}\) and \({\mathcal {M}}_1^{\textrm{BA}}\) provide \(\epsilon _1\)-node DP.

Proof

By Theorem 1, the randomized mechanism \({\mathcal {M}}_1^{ \text {Lap}}\) that takes \(\textbf{A}\) as input and outputs \(p'\) provides \(\epsilon _1\)-node DP. In addition, both the ER and BA graphs are generated using only a single friendship probability \(p'\), as explained above. Thus, by the immunity to the post-processing [10], both \({\mathcal {M}}_1^{\textrm{ER}}\) and \({\mathcal {M}}_1^{\textrm{BA}}\) provide \(\epsilon _1\)-node DP. \(\square \)

Scalability  As shown in Fig. 6, our proposed method consists of the following steps: (1) synthesize a friendship graph, (2) normalize counts to probabilities, (3) synthesize co-locations, and (4) synthesize the other locations. The time complexity of steps (1), (2), (3), (4) is \(O(n^2)\), \(O(|{\mathcal {X}}|^2)\), \(O(\theta (|G'|+|{\mathcal {T}}|+|{\mathcal {X}}|))\), and \(O(n|{\mathcal {T}}||{\mathcal {X}}|^2)\), respectively, where \(G'\) is a synthetic graph. In addition, the time complexity of calculating \(p'\) and \(\textbf{Q}{'}\) from the training dataset is O(|G|) and \(O(n|{\mathcal {T}}|+|{\mathcal {X}}|^2)\), respectively, where G is a training graph. Note that most training graphs are sparse, and \(|G|, |G'| \ll n^2\) in that case. Thus, the time complexity of our proposed method can be expressed as \(O(n^2+\theta (|G'|+|{\mathcal {T}}|+|{\mathcal {X}}|)+n|{\mathcal {T}}||{\mathcal {X}}|^2)\) in total. The factor of \(n|{\mathcal {T}}||{\mathcal {X}}|^2\) is caused by the Viterbi algorithm in step (4). Although the Viterbi algorithm is known as an efficient algorithm, the run time might still be large when n and \(|{\mathcal {X}}|\) are extremely large. We can improve the run time by, e.g., parallel implementation [11].

4.5 End-to-end privacy analysis

Below, we provide end-to-end privacy analysis of our location synthesizer. Let \({\mathcal {S}}_* \subseteq {\mathcal {R}}\) be our synthetic traces. Let \({\mathcal {M}}_*: \{0,1\}^{n \times n} \times {\mathcal {P}}({\mathcal {R}}) \rightarrow {\mathcal {P}}({\mathcal {R}})\) be our location synthesizer that takes the dataset \((\textbf{A}, {\mathcal {S}})\) as input and outputs \({\mathcal {S}}_*\). In addition, let \({\mathcal {M}}_3\) be a training algorithm that takes \({\mathcal {S}}\) as input and outputs a generative model (transition matrix) of the existing synthesizer providing \(\epsilon _3\)-user-level DP [4, 13, 16].

Our synthesizer \({\mathcal {M}}_*\) uses \({\mathcal {M}}_1^{\textrm{ER}}\) or \({\mathcal {M}}_1^{\textrm{BA}}\) to generate a synthetic graph \(G'\) from \(\textbf{A}\). Then, \({\mathcal {M}}_*\) uses \({\mathcal {M}}_2^{ \text {Lap}}\) or \({\mathcal {M}}_2^{ \text {Privelet}}\) to output a co-location count matrix \(\textbf{Q}{'}\) from \({\mathcal {S}}\). Finally, \({\mathcal {M}}_*\) generates synthetic traces using \(G'\), \(\textbf{Q}{'}\), and a transition matrix output by \({\mathcal {M}}_3\).

Then, we have the following privacy guarantees:

Theorem 5

The composition \(({\mathcal {M}}_2^{ \text {Lap}}, {\mathcal {M}}_3)\) or \(({\mathcal {M}}_2^{ \text {Privelet}}, {\mathcal {M}}_3)\) provides \((\epsilon _2 + \epsilon _3)\)-user-level DP.

Proof

By Theorems 2 and 3, both \({\mathcal {M}}_2^{ \text {Lap}}\) and \({\mathcal {M}}_2^{ \text {Privelet}}\) provide \(\epsilon _2\)-user-level DP. In addition, \({\mathcal {M}}_3\) provides \(\epsilon _3\)-user-level DP. Thus, by the composition theorem [10], \(({\mathcal {M}}_2^{ \text {Lap}}, {\mathcal {M}}_3)\) or \(({\mathcal {M}}_2^{ \text {Privelet}}, {\mathcal {M}}_3)\) provides \((\epsilon _2 + \epsilon _3)\)-user-level DP. \(\square \)

Theorem 6

Our location synthesizer \({\mathcal {M}}_*\) provides \((\epsilon _1 + \epsilon _2 + \epsilon _3)\)-total DP.

Proof

As explained above, our synthesizer \({\mathcal {M}}_*\) generates synthetic traces using the outputs of the three mechanisms: (i) \({\mathcal {M}}_1^{\textrm{ER}}\) or \({\mathcal {M}}_1^{\textrm{BA}}\), (ii) \({\mathcal {M}}_2^{ \text {Lap}}\) or \({\mathcal {M}}_2^{ \text {Privelet}}\), and (iii) \({\mathcal {M}}_3\). The first mechanism (i) provides \(\epsilon _1\)-node DP (Theorem 4). The composition of the second and third mechanisms (ii) and (iii) provides \((\epsilon _2 + \epsilon _3)\)-user-level DP (Theorem 5). Then, by Proposition 1, the composition of the three mechanisms (i), (ii), and (iii) provides \((\epsilon _1 + \epsilon _2 + \epsilon _3)\)-total DP.

Our synthesizer \({\mathcal {M}}_*\) generates synthetic traces based on the post-processing on the outputs of (i), (ii), and (iii). Thus, by the immunity to the post-processing [10], \({\mathcal {M}}_*\) also provides \((\epsilon _1 + \epsilon _2 + \epsilon _3)\)-total DP. \(\square \)

Theorem 6 guarantees the overall privacy of our location synthesizer. Note that the existing synthesizer [4, 13, 16] provides \(\epsilon _3\)-user-level DP hence \(\epsilon _3\)-total DP. As shown in Table 1, our synthesizer uses additional privacy budgets \(\epsilon _1\) and \(\epsilon _2\) to incorporate new information (i.e., co-locations) into synthetic traces. In Sect. 5, we show that \(\epsilon _1\) and \(\epsilon _2\) are small, e.g., \(\epsilon _1 = 0.2\) and \(\epsilon _2 = 1\).

5 Experimental evaluation

We evaluated our location synthesizer to show its effectiveness. In Sect. 5.1, we explain datasets used in our experiments. In Sect. 5.2, we describe utility metrics. In Sect. 5.3, we explain location synthesizers evaluated in our experiments. In Sect. 5.4, we report experimental results for parameters in our location synthesizer. In Sect. 5.5, we report results of comparison experiments. In Sect. 5.6, we summarize the experimental results.

5.1 Datasets

In our experiments, we used the Foursquare dataset [50] and the Gowalla dataset [25] (denoted by Foursquare and Gowalla, respectively). Both datasets include the users’ friendship data (i.e., training graph) on SNS and categories/sub-categories of POIs [7]. For our experiments, we used the Tokyo check-in data in each dataset. The Foursquare dataset contained 916,136 check-ins, 8357 users, and 83,647 POIs in Tokyo. The Gowalla dataset contained 184,354 check-ins, 2434 users, and 17,866 POIs in Tokyo. We set the length of a time instant to one hour and extracted two temporally-continuous location events from the dataset (\(|{\mathcal {T}}|=24\)).

In both datasets, check-ins are concentrated in some POIs. Thus, the matrix \(\textbf{Q}\) becomes extremely sparse when using all POIs. Hence, we used check-in data for 100 POIs whose check-in counts are the largest. In this case, \(|{\mathcal {X}}|=100\) and the number n of users was \(n=8357\) in the Foursquare dataset and \(n=1463\) in the Gowalla dataset.

The categories and sub-categories of POIs in each dataset are shown in Table 3. The number m of categories was 4 in both datasets. The total number of co-location events in the traces is 2012 (resp. 51) in Foursquare (resp. Gowalla). Gowalla has much fewer co-location events than Foursquare.

Table 3 POI categories and sub-categories (Foursquare)
Table 4 POI categories and sub-categories (Gowalla)

5.2 Utility metrics

Co-locations  First, we evaluated the utility of our two parameters – the friendship probability \(p'\) and the co-location count matrix \(\textbf{Q}{'}\) – to quantitatively show how our location synthesizer preserves the information about co-locations. For \(p'\), we evaluated the absolute error \(|p - p'|\) between p and \(p'\) as a utility metric. We denote the absolute error of \(p'\) by \(\hbox {AE}_p\).

For \(\textbf{Q}{'}\), co-location counts for each POI category and each time instant (e.g., “travel & transport” from 7AM to 9AM) are particularly important. Thus, we evaluated the utility for each POI category and each time instant. Specifically, let \(\textbf{Q}^* \in \mathbb {Z}_{\ge 0}^{|{\mathcal {T}}| \times |{\mathcal {X}}|}\) be a co-location count matrix before adding noise when we do not perform trimming. \(\textbf{Q}^*\) is identical to \(\textbf{Q}\) when \(c=\infty \). We calculated a per-category co-location count matrix \(\overline{\textbf{Q}}^* \in \mathbb {Z}_{\ge 0}^{|{\mathcal {T}}| \times m}\) (\(|{\mathcal {T}}|=24\), \(m=4\)), which is composed of counts for each time instant and each POI category, by summing up counts in \(\textbf{Q}^*\) for each POI category. \(\overline{\textbf{Q}}^*\) is obtained by counting co-locations for each POI category and each time instant in the training traces. Similarly, we calculated a per-category co-location count matrix \(\overline{\textbf{Q}}' \in \mathbb {Z}_{\ge 0}^{|{\mathcal {T}}| \times m}\) by summing up counts in \(\textbf{Q}{'}\) for each POI category.

Then we evaluated the mean absolute error (MAE) and the mean square error (MSE) between \(\overline{\textbf{Q}}^*\) and \(\overline{\textbf{Q}}'\). The MAE is given by \(\frac{1}{|{\mathcal {T}}|m} \sum _{i=1}^{|{\mathcal {T}}|}\sum _{j=1}^{m} |\overline{\textbf{Q}}^*_{ij}-\overline{\textbf{Q}}'_{ij} |\). The MSE is given by \(\frac{1}{|{\mathcal {T}}|m} \sum _{i=1}^{|{\mathcal {T}}|}\sum _{j=1}^{m} (\overline{\textbf{Q}}^*_{ij}-\overline{\textbf{Q}}'_{ij})^2\). Note that the difference between \(\overline{\textbf{Q}}^*\) and \(\overline{\textbf{Q}}'\) can be caused by two factors: trimming and DP noise. We denote the MAE and MSE of \(\overline{\textbf{Q}}'\) by \(\hbox {MAE}_\textbf{Q}\) and \(\hbox {MSE}_\textbf{Q}\), respectively.

Note that our location synthesizer normalizes counts in \(\textbf{Q}{'}\) to probabilities and synthesizes co-locations based on the co-location probability matrix \(\textbf{R}'\). Therefore, we also normalized counts in \(\overline{\textbf{Q}}^*\) and \(\overline{\textbf{Q}}'\) to probabilities and evaluated the difference between them. Specifically, let \(\overline{\textbf{R}}^*\) (resp. \(\overline{\textbf{R}}'\)) \(\in [0,1]^{|{\mathcal {T}}| \times m}\) be a per-category co-location probability matrix such that \(\overline{\textbf{R}}^*_{ij} = \frac{\overline{\textbf{Q}}^*_{ij}}{\sum _{i=1}^{|{\mathcal {T}}|}\sum _{j=1}^{m}\overline{\textbf{Q}}^*_{ij}}\) \(\left( \textrm{resp}. \overline{\textbf{R}}'_{ij} = \frac{\overline{\textbf{Q}}'_{ij}}{\sum _{i=1}^{|{\mathcal {T}}|}\sum _{j=1}^{m}\overline{\textbf{Q}}'_{ij}}\right) \). In both \(\overline{\textbf{R}}^*\) and \(\overline{\textbf{R}}'\), the sum of all elements is 1. We evaluated the MAE and MSE between \(\overline{\textbf{R}}^*\) and \(\overline{\textbf{R}}'\).

Because both \(\overline{\textbf{R}}^*\) and \(\overline{\textbf{R}}'\) are probability distributions, we also evaluated the JS divergence between \(\overline{\textbf{R}}^*\) and \(\overline{\textbf{R}}'\). The KL divergence and the JS divergence are popular measures of the distance between two probability distributions. We did not evaluate the KL divergence between \(\overline{\textbf{R}}^*\) and \(\overline{\textbf{R}}'\), because the number of co-locations was too small and the KL divergence could be infinity in this case.

Formally, for \(d \in \mathbb {N}\), let \(\Delta _d\) be the d-probability simplex. Let \(\textbf{z}, \textbf{z}^{\prime } \in \Delta _d\) be the two probability distributions. Then, the KL divergence is given by \(D_{KL}(\textbf{z}|| \textbf{z}') = \sum _{i=1}^d \textbf{z}_i \log \frac{\textbf{z}_i}{\textbf{z}'_i}\), where \(\textbf{z}_i\) and \(\textbf{z}'_i\) are the i-th elements of \(\textbf{z}\) and \(\textbf{z}'\), respectively. The JS divergence is given by \(D_{JS}(\textbf{z}|| \textbf{z}') = \frac{1}{2} D_{KL}(\textbf{z}|| \textbf{m}) + \frac{1}{2} D_{KL}(\textbf{z}' || \textbf{m})\), where \(\textbf{m}= \frac{1}{2} (\textbf{z}+ \textbf{z}')\). We denote the MAE, MSE, and JS of \(\overline{\textbf{R}}'\) by \(\hbox {MAE}_\textbf{R}\), \(\hbox {MSE}_\textbf{R}\), and \(\hbox {JS}_\textbf{R}\), respectively.

Furthermore, we selected top-10 co-location events (i.e., ten pairs of time instants and POI categories) whose counts in the training traces are the largest, i.e., top-10 elements in \(\overline{\textbf{Q}}^*\). Then, we visualized the values (counts or probabilities) of the top-10 events in \(\overline{\textbf{Q}}^*\), \(\overline{\textbf{Q}}'\), \(\overline{\textbf{R}}^*\), and \(\overline{\textbf{R}}'\). Formally, let \(\Omega \subset [|{\mathcal {T}}|] \times [m]\) (\(|\Omega |=10\), \(|{\mathcal {T}}|=24\), \(m=4\)) be the set of the top-10 events. We visualized the values of \(\overline{\textbf{Q}}^*_{ij}\), \(\overline{\textbf{Q}}'_{ij}\), \(\overline{\textbf{R}}^*_{ij}\), and \(\overline{\textbf{R}}'_{ij}\) for \((i,j) \in \Omega \).

Our location synthesizer normalizes \(\textbf{Q}{'}\) to \(\textbf{R}'\) and randomly generates co-locations in synthetic traces based on \(p'\) and \(\textbf{R}'\). Therefore, if the absolute error of \(p'\) is small, our synthetic traces preserve the information about how likely two users will be friends. If the MAE, MSE, and JS of \(\overline{\textbf{R}}'\) are small, our synthetic traces preserve the information about how likely a co-location of friends will happen at a certain POI category for each time instant, e.g., “travel & transport” from 7 to 9AM.

It is shown in [50] that there is a correlation between co-locations and friendships on Twitter. Thus, if the MAE, MSE, and JS of \(\overline{\textbf{R}}'\) are small, then a location-based friend suggestion algorithm developed based on the synthetic location data would also be useful for real location data.

Other statistical features  Next, we evaluated how our synthetic traces preserve other statistical features about training traces. Specifically, we calculated two basic statistical features: population distribution and transition probability matrix. The population distribution (\(|{\mathcal {X}}|\)-dimensional probability vector) is a key feature to find popular POIs [54]. The transition probability matrix (\(|{\mathcal {X}}| \times |{\mathcal {X}}|\) matrix) is a key feature to model user movement patterns [47]. We calculated these statistical features from the training traces and the synthetic traces.

Formally, let \(\textbf{r}\) (resp. \(\textbf{r}'\)) \(\in \Delta _{|{\mathcal {X}}|}\) be a population distribution calculated from the training trace (resp. synthetic traces). For example, \(\textbf{r}= \left( \frac{5}{24}, \frac{1}{4}, \frac{1}{6}, \frac{1}{4}, \frac{1}{8}\right) \) in Fig. 2. In the transition probability matrix, each row of the matrix represents a probability distribution. Let \(\textbf{M}\) (resp. \(\textbf{M}'\)) \(\in [0,1]^{|{\mathcal {X}}|\times |{\mathcal {X}}|}\) be the transition probability matrix calculated from the training traces (resp. synthetic traces). For \(i \in |{\mathcal {X}}|\), let \(\textbf{M}_i\) (resp. \(\textbf{M}'_i\)) \(\in \Delta _{|{\mathcal {X}}|}\) be the i-th row of \(\textbf{M}\) (resp. \(\textbf{M}'\)). For example, \(\textbf{M}_1 = (\frac{1}{4}, \frac{1}{2}, 0, \frac{1}{4}, 0)\) in Fig. 2.

For each statistical feature, we evaluated the distance between the synthetic traces and the training traces. We adopted the MAE, MSE, KL divergence, and JS divergence as distance measures. Here, we evaluated the KL divergence because the number of locations is large (unlike co-locations).

For the transition probability matrix, each row represents a probability distribution. Thus, we evaluated the weighted average of the KL/JS divergence, where we used a stationary distribution calculated from the matrix as a weight vector. We denote the MAE/MSE/KL/JS of \(\textbf{r}'\) (resp. \(\textbf{M}'\)) by \(\hbox {MAE}_\textbf{r}\)/\(\hbox {MSE}_\textbf{r}\)/\(\hbox {KL}_\textbf{r}\)/\(\hbox {JS}_\textbf{r}\) (resp. \(\hbox {MAE}_\textbf{M}\)/\(\hbox {MSE}_\textbf{M}\)/\(\hbox {KL}_\textbf{M}\)/\(\hbox {JS}_\textbf{M}\)).

Downstream tasks, such as finding popular POIs [54] and predicting the next POI [47], are based on the population distribution and the transition matrix. For example, popular POIs can be obtained by selecting locations whose values in the population distribution are the largest. Given a specific POI, the next POI can be predicted by selecting a POI whose probability in the transition matrix is the largest. Thus, if the distance measures (MAE, MSE, KL, and JS) of the population distribution (resp.  transition matrix) are small, then the synthetic data would be useful for finding popular POIs (resp. predicting the next POI).

Table 5 summarizes the utility metrics and their notations in our experiments.

Table 5 Utility metrics in our experiments

5.3 Location synthesizers

In our experiments, we evaluated three location synthesizers for comparison. The first synthesizer is to independently and randomly generates a location at each time instant from a uniform distribution. We call this simple method Uniform.

The second synthesizer is the synthetic data generator proposed in  [4]. This synthesizer can be applied to any data, including location traces [29]. Following [29], we applied this synthesizer to location traces as follows. First, we trained a transition probability matrix (\(|{\mathcal {X}}| \times |{\mathcal {X}}|\) matrix) common to all users from training traces and added the Laplace noise \(\text {Lap}(\frac{c}{\epsilon _3})\) to each element. Adding the Laplace noise provides \(\epsilon _3\)-user-level DP. Then, we randomly generated the first location based on the stationary distribution and then generated the remaining locations using the transition matrix. Because this method is designed on the basis of the transition probability matrix, we call it TPM.

The third synthesizer is our proposed synthesizer. We call it Proposal. In Proposal, we trained \(p^{\prime }\) from a training graph by adding the Laplace noise, and \(\textbf{Q}^{\prime }\) from training traces using the Laplace mechanism or Privelet. Then we generated \(\theta \) co-locations using \(p^{\prime }\) and \(\textbf{Q}^{\prime }\). Finally, we generated the remaining locations using TPM, which provides \(\epsilon _3\)-user-level DP as explained above.

Note that besides TPM [4], there are other existing location synthesizers such as [13, 16]. We evaluated TPM for two reasons. First, TPM is easy to implement. Second, all of the existing synthesizers [4, 13, 16] do not introduce the concept of “friends” and therefore lack the utility of co-locations. In other words, they have the same values for the utility of co-locations. In our experiments, we quantitatively show that TPM lacks the utility of co-locations, which means that the synthesizers in [13, 16] also lack the utility of co-locations. We leave synthesizing locations other than co-locations in Proposal using the synthesizers in [13, 16] for future work.

In each of Uniform, TPM, and Proposal, the length of a time instant was set to be one hour, and a trace with the length of one day was generated for each of n users. For each synthesizer, synthetic traces were generated five times, and the utility metrics were averaged over the five runs to stabilize the performance.

5.4 Experimental results for parameters in our location synthesizer

First, we evaluated how well parameters of our synthesizer (Proposal) preserve the information about co-locations.

Friendship probability \(p^{\prime }\)   In Figs. 8 and 9, we show the absolute error of \(p^{\prime }\) when we changed the privacy budget \(\epsilon _1\) from 0.01 to 5 in Foursquare and Gowalla.

Fig. 8
figure 8

Absolute error of \(p'\) versus \(\epsilon _1\) in Foursquare

Fig. 9
figure 9

Absolute error of \(p'\) versus \(\epsilon _1\) in Gowalla

It is seen from Fig. 8 that the absolute error rapidly decreases as \(\epsilon _1\) increases from 0.01 to 0.5. Note that the absolute error depends only on the Laplace noise, as \(p' = p + \text {Lap}(\frac{2}{n\epsilon _1})\). This explains the reason that the absolute error decreases as the value of \(\varepsilon \) increases. It is also seen from Fig. 8 that the absolute error is extremely small and almost equals to 0 after \(\epsilon _1=0.2\). This result demonstrates that we can accurately estimate the friendship probability \(p'\) with a small privacy budget \(\epsilon _1=0.2\) in node DP for friendship data. We observe from Fig. 9 that the result of Gowalla is similar to that of FourSquare.

Fig. 10
figure 10

MAE of \(\overline{\textbf{Q}}'\) versus c (\(\varepsilon _2 = 1\)) in Foursquare

Fig. 11
figure 11

MSE of \(\overline{\textbf{Q}}'\) versus c (\(\varepsilon _2 = 1\)) in Foursquare

Fig. 12
figure 12

MAE of \(\overline{\textbf{Q}}'\) versus \(\epsilon _2\) (\(c=5\)) in Foursquare

Fig. 13
figure 13

MSE of \(\overline{\textbf{Q}}'\) versus \(\epsilon _2\) (\(c=5\)) in Foursquare

Per-category co-location count matrix \(\overline{\textbf{Q}}'\)  The MAE and the MSE of \(\overline{\textbf{Q}}'\) in Foursquare are shown in Figs. 10 and 11. Here, we set the privacy budget \(\epsilon _2\) in user-level DP for training traces to \(\epsilon _2=1\), and we set the upper limit c on the number of co-locations per user in trimming to \(c=1\), 5, 10, 15, or 20.

It is seen from Figs. 10 and 11 that Privelet has much smaller MAE and MSE than the Laplace mechanism. This means that Privelet significantly reduces the amount of noise for each POI category and each time instant, e.g., subway in the morning. It is also seen that when \(c=5\) and 10, Privelet has the smallest MAE and MSE, respectively. These results show that there is a trade-off between the effects of trimming and the Laplace noise; i.e., the effect of trimming is large when c is small, whereas the amount of the Laplace noise is large when c is large.

In Figs. 12 and 13, we also show the relationship between \(\epsilon _2\) and the MAE/MSE in Foursquare when we set c to \(c=5\). It is seen from these figures that the MAE and the MSE rapidly decrease as \(\epsilon _2\) increases from 0.1 to 1 and they remain almost unchanged after \(\epsilon _2=1\).

It is seen from Fig. 13 that when \(\epsilon _2\) is 2.5 or more, the MSE of Privelet is larger than that of Laplace. One reason for this is that Privelet adds noise to each node of a tree structure and the number of nodes in Privelet is larger than that of elements in \(\textbf{Q}\).

The MAE and the MSE of \(\overline{\textbf{Q}}'\) in Gowalla are shown in Figs. 14, 15, 16, and 17. We set the privacy budget \(\epsilon _2\) and the upper limit c to the same value as Foursquare.

It is seen from Figs. 14 and 15 that Privelet has the smallest MAE and MSE when \(c=1\), unlike the results in Foursquare. This is because the number of co-locations in Gowalla was much smaller than that of Foursquare and was not affected by trimming even when \(c=1\).

It is also seen from Figs. 16 and 17 that the MAE and the MSE decrease significantly when we increase \(\epsilon _2\) from 0.1 to 1. In addition, when \(\epsilon _2\) is larger than 1, the MAE of Privelet is almost unchanged. Therefore, we can obtain an accurate co-location count matrix with a reasonable factor of \(\epsilon _2=1\).

Fig. 14
figure 14

MAE of \(\overline{\textbf{Q}}'\) versus c (\(\epsilon _2=1\)) in Gowalla

Fig. 15
figure 15

MSE of \(\overline{\textbf{Q}}'\) versus c (\(\epsilon _2=1\)) in Gowalla

Fig. 16
figure 16

MAE of \(\overline{\textbf{Q}}'\) versus \(\epsilon _2\) (\(c=5\)) in Gowalla

Fig. 17
figure 17

MSE of \(\overline{\textbf{Q}}'\) versus \(\epsilon _2\) (\(c=5\)) in Gowalla

Table 6 Top-10 co-location events (time instants and POI categories) in the training datasets

Top-10 co-location events Table 6 shows the top-10 co-location events whose counts in the training traces are the largest, i.e., top-10 elements in \(\overline{\textbf{Q}}^*\). For example, co-location events are likely to occur in “travel & transport” in the morning in Foursquare and “travel” in the morning or at night in Gowalla. Figs. 18 and 19 show the counts and probabilities (i.e., values of \(\overline{\textbf{Q}}^*_{ij}\), \(\overline{\textbf{Q}}'_{ij}\), \(\overline{\textbf{R}}^*_{ij}\), and \(\overline{\textbf{R}}'_{ij}\)) of the top-10 co-location events. It is seen from these figures that count values in the Laplace mechanism are much larger than those in the training datasets. This is because we normalize each row of \(\textbf{Q}{'}\) so that all elements are non-negative and the sum of the rows is 1. In other words, the normalization introduces a large positive bias. Privelet provides much smaller errors in counts, especially in Gowalla. This explains the reason that the MAE/MSE of Privelet is much smaller than that of the Laplace mechanism in Figs. 14, 15, 16, and 17.

Fig. 18b shows that Privelet preserves the probability information well in Foursquare. However, Fig. 19b shows that this is not the case with Gowalla. This is because Gowalla has a very small number of co-location events (only 51 events, as described in Sect. 5.1). In this case, almost all elements (83 out of 96 elements) of \(\overline{\textbf{Q}}^*\) are 0, and Privelet also assigns 0 to almost all elements in \(\overline{\textbf{Q}}'\). In other words, Gowalla is much more challenging than Foursquare. Based on Figs. 18 and 19, we use Privelet in Foursquare and the Laplace mechanism in Gowalla in the following comparison experiments.

We emphasize that the existing synthesizers do not introduce a concept of “friends” in their model and therefore do not preserve any information about co-locations (as shown in Table 1). Our synthesizer adds the new information by training a friendship probability and co-location count matrix. Additional privacy budgets \(\epsilon _1\) and \(\epsilon _2\) are reasonably small, e.g., \(\epsilon _1 = 0.2\) and \(\epsilon _2 = 1\).

5.5 Results of comparison experiments

Three location synthesizers Next, we compare Proposal with two baselines: TPM and Uniform. We evaluated the utility of co-locations, population distributions, and transition matrices for each synthesizer. For the utility of co-locations, we evaluated the MAE, MSE, and JS divergence of \(\overline{\textbf{R}}'\), as described in Sect. 5.2. In TPM and Uniform, we used a uniform distribution as \(\overline{\textbf{R}}'\) because they had no concept of “friends.” For the utility of population distributions and transition matrices, we evaluated the MAE, MSE, KL divergence, and JS divergence. We used the ER or BA model for Proposal. We set \(\epsilon _1 = \epsilon _2 = \epsilon _3 = 1\) and \(c=5\). Although we set all the three privacy budgets to 1 for simplicity, we can also assign a smaller value to \(\epsilon _1\) without affecting the utility, e.g., \(\epsilon _1=0.2\), as shown in Figs. 8 and 9. For the number \(\theta \) of generated co-location events in Proposal, we set \(\theta =100\). We also report the relationship between \(\theta \) and the utility in Appendix 1.

Fig. 18
figure 18

Counts and probabilities of top-10 co-location events in Foursquare (\(\epsilon _2=1\), \(c=5\)). For counts, we show the values in \(\overline{\textbf{Q}}^*\) (Training) and \(\overline{\textbf{Q}}'\) (Laplace/Privelet). For probabilities, we show the values in \(\overline{\textbf{R}}^*\) (Training) and \(\overline{\textbf{R}}'\) (Laplace/Privelet))

Fig. 19
figure 19

Counts and probabilities of top-10 co-location events in Gowalla (\(\epsilon _2=1\), \(c=5\)). For counts, we show the values in \(\overline{\textbf{Q}}^*\) (Training) and \(\overline{\textbf{Q}}'\) (Laplace/Privelet). For probabilities, we show the values in \(\overline{\textbf{R}}^*\) (Training) and \(\overline{\textbf{R}}'\) (Laplace/Privelet))

Fig. 20
figure 20

Comparison results in Foursquare (\(\epsilon _1=\epsilon _2=\epsilon _3=1\), \(c=5\), \(\theta =100\)). a MAE/MSE/JS of \(\overline{\textbf{R}}'\), b MAE/MSE/KL/JS of \(\textbf{r}'\), c MAE/MSE/KL/JS of \(\textbf{M}'\). Smaller is better in all utility metrics. Proposal uses Privelet

Fig. 21
figure 21

Comparison results in Gowalla (\(\epsilon _1=\epsilon _2=\epsilon _3=1\), \(c=5\), \(\theta =100\)). a MAE/MSE/JS of \(\overline{\textbf{R}}'\), b MAE/MSE/KL/JS of \(\textbf{r}'\), c MAE/MSE/KL/JS of \(\textbf{M}'\). Smaller is better in all utility metrics. Proposal uses the Laplace mechanism

Figs. 20 and 21 show the results. Figures 20a and Fig. 21a show that Proposal significantly outperforms TPM and Uniform in terms of the utility of co-locations, which demonstrates the effectiveness of Proposal. In addition, Fig. 20b, c and Fig. 21b, c show that Proposal significantly outperforms Uniform and has almost the same utility as TPM in terms of the utility of the population distribution and the transition matrix.

One exception is that both Proposal and TPM have larger MSE than Uniform in Gowalla (see Fig. 21b, c). One reason for this is that the check-in data in Gowalla includes many outliers who have unique transition patterns. Because the MSE squares the error, it is significantly affected by such outliers.

However, we emphasize that Proposal has smaller MAE, KL divergence, and JS divergence than Uniform, as shown in Fig. 21b, c. This result indicates that Proposal preserves the transition matrix well.

Fig. 22
figure 22

Degree distribution of the training graph

Fig. 23
figure 23

Degree distribution of the ER graph (\(\epsilon _1 = 1\))

Fig. 24
figure 24

Degree distribution of the BA graph (\(\epsilon _1 = 1\))

ER and BA graph models  Finally, we compared the ER model with the BA model in Proposal. We first evaluated an average degree (i.e., the average number of friends) in the training graph, the ER graph, and the BA graph. Table 7 shows the results. It is seen from this table that both the ER and BA graphs preserve the average degree well.

Table 7 Average degree

Next, we evaluated a degree distribution for each graph. Figures 22, 23, and 24 show the results. It is seen from these figures that the training graph has a power-law degree distribution. The ER graph does not preserve this property. In contrast, the BA graph has a power-law degree distribution and is very similar to the training graph. Therefore, the BA graph reproduces the friendship property of the original data more accurately than the ER graph.

5.6 Summary

In summary, through comprehensive evaluation using two real datasets and various utility metrics, we showed the following results.

  • Proposal preserves the information about co-locations of friends, whereas existing location synthesizers such as TPM do not. In particular, Proposal with the BA graph has a power-law degree distribution and preserves the friendship property of the training graph.

  • Proposal also preserves other statistical features such as the population distribution and the transition matrix.

  • Proposal synthesize such realistic traces while providing strong privacy, e.g., 0.2-node DP (\(\epsilon _1=0.2\)) for the training graph and 2-user-level DP (\(\epsilon _2 = \epsilon _3 = 1\)) for the training traces.

We need additional privacy budgets \(\epsilon _1\) and \(\epsilon _2\) to preserve the information about co-locations, which has not been considered in the existing synthesizers (as shown in Table 1). The additional privacy budgets are reasonably small, e.g., \(\epsilon _1=0.2\) and \(\epsilon _2=1\). Thus, we conclude that a synthetic trace generation that preserves co-locations of friends is now possible under strong privacy notations such as node DP and user-level DP.

6 Conclusion

We proposed a location synthesizer for generating synthetic traces that include co-locations of friends. Our location synthesizer generates such traces, while providing node DP for a training graph and user-level DP for training traces. Through comprehensive experiments using two real datasets, we showed that our synthesizer generates synthetic traces that preserve information about co-locations, such as the friendship probability, the co-location count matrix, and the degree distribution. Our synthetic traces also preserve other statistical features, such as the population distribution and transition matrix. The proposed synthesizer generates such traces while providing node DP and user-level DP with reasonable privacy budgets, e.g., 0.2-node DP (\(\epsilon _1 = 0.2\)) for the training graph and 2-user-level DP (\(\epsilon _2= \epsilon _3 = 1\)) for the training traces. For example, our synthetic traces are useful for studying the effectiveness of friend suggestion on SNS based on co-locations [50].

In this work, we regarded the number \(\theta \) of generated co-locations as a tuning parameter. For future work, we would like to automatically determine an appropriate value of \(\theta \) while providing DP for the training traces. Another interesting future work would be to incorporate the real-valued friendship level (rather than 0/1 considered in this work) between users into our algorithm. We would also like to use graph generation models with parameters other than a single friendship probability (e.g., exponential random graph model [43], stochastic block model [17]) under DP.