Location histogram privacy by sensitive location hiding and target histogram avoidance/resemblance (extended version)

A location histogram is comprised of the number of times a user has visited locations as they move in an area of interest, and it is often obtained from the user in applications such as recommendation and advertising. However, a location histogram that leaves a user's computer or device may threaten privacy when it contains visits to locations that the user does not want to disclose (sensitive locations), or when it can be used to profile the user in a way that leads to price discrimination and unsolicited advertising. Our work introduces two privacy notions to protect a location histogram from these threats: sensitive location hiding, which aims at concealing all visits to sensitive locations, and target avoidance/resemblance, which aims at concealing the similarity/dissimilarity of the user's histogram to a target histogram that corresponds to an undesired/desired profile. We formulate an optimization problem around each notion: Sensitive Location Hiding (SLH), which seeks to construct a histogram that is as similar as possible to the user's histogram but associates all visits with nonsensitive locations, and Target Avoidance/Resemblance (TA/TR), which seeks to construct a histogram that is as dissimilar/similar as possible to a given target histogram but remains useful for getting a good response from the application that analyzes the histogram. We develop an optimal algorithm for each notion and also develop a greedy heuristic for the TA/TR problem. Our experiments demonstrate that all algorithms are effective at preserving the distribution of locations in a histogram and the quality of location recommendation. They also demonstrate that the heuristic produces near-optimal solutions while being orders of magnitude faster than the optimal algorithm for TA/TR.


Introduction
A location histogram is a statistical summary of a user's whereabouts, comprised of the number of times a user has visited each location in an area of interest. Location histograms are often obtained from users, in the context of applications including recommendation [37,38,80], advertising [16,25], and location pattern discovery [79]. For example, a recommender application typically employs a set of location histograms each corresponding to a different user (i.e., a user-location matrix) as a training set, and it aims at recommending locations that a user may be interested in visiting based on the user's histogram [80]. Location histograms are also often visualized or analyzed directly [84].
However, a location histogram that leaves a user's computer or device may pose a threat to the user's privacy. This happens when the histogram contains visits to sensitive locations that the user does not want to disclose, because they are associated with confidential information (e.g. a temple is associated with a religion, and the headquarters of a political organization with certain political beliefs), or when the histogram can be used to profile the user (e.g. as "wealthy" or "minority member") leading to price discrimination [46,47] and unsolicited advertising [6]. For example, if the histogram reveals that a user frequently visits expensive restaurants, a targeted-advertisement application may display to the user advertisements about products and services that are priced higher than normal [46,47].
In this work, we introduce two novel notions of histogram privacy, sensitive location hiding and target avoidance/resemblance, for protecting against the disclosure of sensitive locations and user profiling, respectively. Sensitive location hiding aims at concealing all visits to user-specified sensitive locations, by producing a sanitized histogram, in which the frequencies associated with the sensitive locations are equal to zero. This protects a user from an adversary who receives the sanitized histogram, knows the set of locations considered to be sensitive, and tries to infer which of these sensitive locations were visited by the user. By enforcing the notion of sensitive location hiding, users are able to disseminate their location histogram in order to benefit from location-based services, such as location recommendation, while being protected from the inference of their sensitive locations and the aforementioned consequences such inference may have.
Target avoidance aims at concealing the fact that the user's histogram is similar to an undesirable histogram that, if disseminated, would lead to undesired user profiling. For example, a user may wish to make their histogram dissimilar to a target histogram of a typical wealthy person, containing frequent visits to expensive restaurants, to avoid price discrimination [46]. As another example, a user's location histogram may allow the inference of the user's political affiliation, religious beliefs, and sexual orientation, which may lead to emotional distress, harassment or even persecution. Thus, a user would wish to avoid disseminating a histogram that is similar to histograms that can lead to such undesirable inferences. This protects from adversaries who use the sanitized histogram and the target histogram of a person with an undesirable profile, to infer that the user's histogram resembles the latter histogram.
Target resemblance is a variant of target avoidance, in which the user expressly wishes to make their histogram similar to the target histogram representing a desirable profile. For example, the desirable target histogram for a tourist could be that of a local resident in order to avoid discriminatory practices towards tourists (e.g., price discrimination). As another example, consider a company that engages in secret discriminatory hiring practices by preferentially hiring members of a particular demographic group. There are cases where companies have been shown to discriminate based on sexual orientation when hiring [70]. In these cases, a person who wishes to be hired will want to make their histogram resemble that of an heterosexual person, so as to avoid discriminatory treatment. The target histogram may be specified by the users themselves, or selected with the help of domain experts (see Section 3.3). Enforcing target resemblance protects from adversaries who use the sanitized histogram and the target histogram of a person with a desirable profile, to infer that the user's histogram does not resemble the latter histogram.
Comparing target avoidance and target resemblance, we see that in both cases the adversary aims to infer whether or not the sanitized histogram resembles a given target histogram. The difference is that, in target avoidance, the user wants the adversary to conclude that there is no resemblance, whereas in target resemblance the user wants the opposite.
Our privacy notions can be achieved by histogram sanitization, i.e., by changing the frequencies of location visits in the histogram. However, sanitization incurs a quality (utility) loss, which must be controlled to ensure that the user obtains a good response from the application which uses their sanitized histogram. To achieve this balance between privacy and quality, we define an optimization problem around each privacy notion: the Sensitive Location Hiding (SLH) problem, which seeks to construct a sanitized histogram with minimum quality loss, and the Target Avoidance/Resemblance (T A/T R) problem, which seeks to avoid/resemble the target to a level at least equal to a user-provided privacy parameter, while ensuring that the quality loss does not exceed a user-provided quality parameter. If it is impossible to satisfy both the privacy and the quality requirements, then the problem has no solution.
Neither notion can be achieved by existing methods for histogram sanitization. The aim of existing methods is to either (I) prevent the inference of the exact frequencies of the histogram (i.e., the number of visits to one or more locations) [2,18,31,34,56,76,84], or (II) make a user's histogram indistinguishable from a set of histograms belonging to other users [20,24,75]. Their aim is neither to hide sensitive locations, nor to avoid/resemble a target histogram. The privacy notions we introduce in the paper are important to achieve in real applications, as we discuss in Examples 2.1 and 2.2 in Section 2.
Therefore, we develop new methods for achieving the SLH and the T A/T R notions: (I) An optimal algorithm for SLH, called LHO (Location Hiding Optimal). (II) An optimal algorithm for T R, called RO (Resemblance Optimal). (III) A greedy heuristic for T R, called RH (Resemblance Heuristic). Because T A and T R are similar, we focus on T R and discuss T A briefly.
Our methods are both effective and efficient, as demonstrated by experiments using two real datasets derived from the Foursquare location-based social network [77], which together contain approximately 3400 histograms. In terms of effectiveness, all algorithms achieve the corresponding notions, or announce that it is impossible to achieve them, and they are additionally able to preserve: (I) the distribution of locations in a histogram, which is useful in applications such as aggregate query answering and classification [41,84], and (II) the quality of location recommendation based on Collaborative Filtering [44]. In addition, the heuristic produces near-optimal solutions (up to 1.5% worse than the optimal), with respect to preserving distribution similarity. In terms of efficiency, all algorithms scale well with the histogram parameters, requiring from less than 1 second (the LHO algorithm) to 5 minutes (the RO algorithm). In addition, the RH heuristic is more efficient than the optimal algorithm by at least two orders of magnitude.
We note that our notions are framed in the context of location histograms but can be applied to any histogram. For example, they could be applied to a histogram comprised of webpage visits. The resultant sanitized histogram would then conceal visits to webpages that a user does not want to disclose, or it would resemble/avoid a target histogram for protecting the user from targeted advertising based on their webpage visits. Organization We provide an overview of and motivation for our approach in Section 2; we introduce formal notation, and we formalize the privacy notions, the adversary models, and the optimization problems we solve in Section 3; we describe our algorithms and our heuristics in Section 4; we evaluate our approach in Section 5; we discuss related work in Section 6; we conclude the paper in Section 7.

Overview and motivation of our approach
This section provides examples to motivate the need for sensitive location hiding and target resemblance and also provides a high-level overview of the optimization problems and methods for solving them.

Sensitive Location Hiding
Given a set of sensitive locations, a histogram satisfies the notion of sensitive location hiding when the frequency of each of its sensitive locations is zero. Clearly, one simple strategy to achieve this notion is by setting the frequency of each sensitive location of a given histogram to zero. However, this strategy may have a substantial negative impact on the quality (utility) of the histogram in location histogram applications. This is because it reduces the size (sum of frequencies) of the histogram. A size reduction should be avoided because some important statistics depend on the size of the histogram. An example of such statistics is the fraction of all users' visits to a particular location in a city (i.e., the ratio between the sum of the frequency of the location over all users' histograms and the sum of the sizes of these histograms), which is a simple indicator of the popularity of the location. Another example is the average number of visits to a location (i.e., the ratio between the size of the user's histogram and the number of locations in the histogram), which is used in location recommendation [8,44].
A different strategy that achieves the sensitive location hiding notion, while preserving the size of the histogram is to redistribute the frequency counts of the sensitive locations to non-sensitive ones. However, the redistribution needs to be performed in a way that preserves the quality (utility) of the histogram in location histogram applications. The impact of each possible redistribution on quality must be quantified, and the selected redistribution strategy must be the one with the lower impact. We quantify the impact of a redistribution strategy with a quality distance function, similarly to most works on histogram sanitization [2,34,76,84]. This function offers generality, because different functions can be chosen for different applications.
The above discussion motivates the formulation of the Sensitive Location Hiding (SLH) optimization problem: Given a histogram H, a set of sensitive locations, and a quality distance function, produce a sanitized histogram H such that the frequency of each sensitive location in H is 0, H is as similar as possible to H, and H has the same size as H. Similarity is measured with the quality distance function.
In Section 3.2 we give a formal definition of the SLH optimization problem, discuss the adversary model it provides protection against, and show that the problem is weakly NP-hard [52]. In addition, we discuss a variation of the problem which relaxes the size requirement and can be easily dealt with by our algorithms.
To illustrate the SLH notion and the SLH problem, we now provide Example 2.1, which is inspired from approaches on privacy-preserving recommendation [59,71]. However, the SLH notion and problem are not tied to recommendation and cannot be handled with existing approaches.   2.1 (Illustration of the SLH notion and SLH problem) An application provides location recommendations to users by analyzing their location profiles. To obtain a recommended location, a user must send 50 location visits to the application in the form of a location histogram. To compute the recommended location, the application uses common mining tasks, such as discovering frequent location patterns in the user's histogram and finding similar histograms to it [83]. The location histogram H of a user Alice is shown in Figure 1a. The histogram contains the number of times Alice visited each of the locations a to h. Alice is not willing to provide H to the application, because the last two locations in H, g and h, are sensitive, but she still wishes to receive a "good" recommended location by the application. Therefore, Alice solves the SLH problem and obtains the sanitized histogram H shown in Figure 1b. The sanitized histogram preserves privacy, because it does not contain the sensitive locations. It can be sent to the application to receive a fairly accurate recommendation, because it contains 50 visits to nonsensitive locations (the visits to sensitive locations are zero and not shown) and is as "similar" as possible to H, to the extent permitted by the privacy requirement.
To optimally solve the SLH problem, the LHO algorithm finds the exact number of sensitive location visits that need to be redistributed into each nonsensitive bin (bin corresponding to a nonsensitive location), so that all sensitive location visits are redistributed and quality is optimally preserved, with respect to the s (2,0) .  Each node (i, j), i ∈ [1, m], j ∈ [0, K], in the path denotes the redistribution of j sensitive location visits into the nonsensitive bins 1, . . . , i. The path corresponds to the optimal way of redistributing all K sensitive location visits into all m nonsensitive bins. The weight of the edge ((i, j), (i + 1, j + k)) denotes the impact on quality caused by redistributing k sensitive location visits into the nonsensitive bin i + 1, and the sum of the edge weights of this path e 1 + · · · + em quantifies the quality distance between the optimal solution and H quality distance function. That is, the algorithm determines the frequency of each nonsensitive location of the sanitized histogram H , so that H has the same size with the given histogram H and is as similar as possible to it, with respect to the quality distance function. However, it is computationally prohibitive to directly compute the quality of each possible redistribution of the sensitive location visits into the nonsensitive bins and then select the optimal solution. This follows from the fact that there are O ( K+m−1 m−1 ) ways to redistribute K sensitive location visits into m nonsensitive bins (each way corresponds to a weak composition of K [7]). Therefore, LHO solves the problem by modeling it as a shortest path problem between two specific nodes, s and t, of a directed acyclic graph (DAG) (see Figure 2). The node s is labeled (0, 0), and each other node is labeled (i, j), where i ∈ [1, m] corresponds to a nonsensitive location L i and j ∈ [0, K] corresponds to the number of sensitive location visits that will be redistributed into the nonsensitive bins 1, . . . , i of the sanitized histogram H . For example, the label (m, K) of the node t denotes the redistribution of all K sensitive location visits to all m nonsensitive bins of H . We may refer to a node using its label. The graph contains an edge from each node (i, j) to each node (i + 1, j + k) with k ∈ [0, K − j], where k denotes the number of sensitive location visits that are redistributed into the nonsensitive bin i + 1. For example, the edge ((i, j), (i + 1, j + k)) = ((1, 1), (2, 1)) denotes that k = 0 visits are redistributed into the nonsensitive bin i + 1 = 2. Each edge ((i, j), (i + 1, j + k)) has a weight that quantifies the impact on quality caused by the redistribution of k sensitive location visits into the nonsensitive bin i + 1. Every path from s to t corresponds to a feasible solution of the SLH problem. This is because the nodes in the path uniquely determine how all K sensitive location visits will be redistributed into all m nonsensitive bins of H (see property (I) in Section 4.1). In addition, the length (sum of edge weights) of the path is equal to the quality distance between the corresponding solution H and H (see property (II) in Section 4.1). Thus, the shortest path from s to t corresponds to a histogram H that is as similar as possible to H and therefore it is the optimal solution of the SLH optimization problem. For example, applying the LHO algorithm to the histogram of Figure 1a, when the locations g and h are sensitive and the quality distance function is Jensen-Shannon divergence (see Section 3.1.1), produces the sanitized histogram in Figure 1b. Note that the visits to g and h are redistributed into all nonsensitive bins, so that the sanitized histogram is as similar as possible to the histogram of Figure 1a. A formal description of the LHO algorithm, as well as the analysis of the algorithm is provided in Section 4.1.

Target Resemblance
Given a target histogram, a histogram satisfies the notion of target resemblance when it is similar enough to the target. A privacy distance function quantifies similarity, and a privacy parameter quantifies the threshold for determining whether the two histograms are similar enough.
Clearly, any histogram can be easily modified to be arbitrarily similar to a given target histogram, by simply redistributing all its frequency counts so that they are exactly equal to the counts in the target histogram. However, as in the case of SLH, a simplistic redistribution can deteriorate quality unacceptably. The modification to the histogram must balance between resemblance to the target histogram and similarity to the original histogram. A quality distance function quantifies the quality loss caused by the modification, and a quality parameter quantifies the threshold for determining whether the loss is acceptable or not.
The above discussion motivates the formulation of the Target Resemblance (T R) optimization problem: Given a histogram H, a target histogram H , a quality distance function and a quality parameter , a privacy distance function and a privacy parameter c, produce a sanitized histogram H such that its quality distance from H is at most , its privacy distance from H is minimized, and its size is the same as H. If the resulting privacy distance of H from H is larger than c, then there is no solution.
In Section 3.3 we give a formal definition of the T R problem, discuss the adversary model it provides protection against, and we show that it is weakly NP-hard. In addition, we discuss a variation that relaxes the size requirement and can be easily dealt with by our algorithms. To illustrate the T R privacy notion and optimization problem, we provide Example 2.2. Example 2.2 (Illustration of the T R notion and problem, continuing from Example 2.1) Figure 1a shows the location histogram H of a user, Bob, who wants to use the location recommendation application. Bob is not willing to provide H to the application, because he is concerned about price discrimination, as a result of frequent visits to locations f ("airport") and g ("5-star hotel"). To achieve his purpose, Bob can solve the Target Resemblance (T R) problem to generate a histogram that resembles the target histogram H in Figure 1c. H reflects a budget-conscious person, because in H the frequencies of locations a ("train station"), b ("2-star hotel"), and c ("3-star hotel") are relatively high, whereas the frequencies of f and g are relatively low. Hence, H is likely to attract lower-priced recommendations than H would, and it is definitely more likely to prevent price discrimination [46,47]. The resemblance to H is satisfied by generating a sanitized histogram H RO (RO for "Resemblance Optimal") that minimizes a privacy distance function between the sanitized histogram and H . In parallel, Bob still wishes to receive a "good" recommended location by the application. This quality requirement is satisfied by limiting the dissimilarity between H and the sanitized histogram H RO to a maximum of = 0.05, as measured by a quality distance function, so that the sanitization preserves the similarity between H and other users' histograms, which helps compute a "good" recommended location [44]. After solving the T R problem, Bob obtains the sanitized histogram H RO in Figure 1d, which is almost identical to the target H .
To optimally solve the T R problem, the Resemblance Optimal (RO) algorithm finds the exact number of location visits that need to be added into, or removed from, each bin of a histogram H, so that the resultant sanitized histogram H is as similar as possible to the target histogram H , and no more dissimilar from H than what is allowed by the quality threshold . Again, the large number of potential solutions, given by O ( N +n−1 n−1 ) , where N is the size of H and n is its length, prohibits directly computing the quality of each possible solution and selecting the optimal solution. Therefore, RO solves the problem by modeling it as a constrained shortest path problem in a DAG (see Figure 3). The graph contains a path (u 0 0 , u N1 1 , . . . , u Nn n ) for each allocation of N = Nn counts to the n bins of the histogram (i.e., each allocation corresponds to a possible solution to the Target Resemblance problem, ignoring the quality constraint), where a node u Ni i corresponds to allocating N i counts to bins 1 up to and including i. The length of a path is equal to the dissimilarity of the corresponding allocation to the target histogram H , whereas the cost of the path is equal to the quality loss as compared to the user's histogram H. The algorithm finds the shortest path among those whose cost does not exceed the quality threshold . As the graph is a DAG, to find the optimal solution it suffices to explore it in Breadth-First Search order. First, we compute constrained shortest paths to all nodes that correspond to bin 1: u N1 1 , N 1 = 0, . . . , N ; then, we extend these paths to all nodes that correspond to bin 2: u N2 2 , N 2 = 0, . . . , N and we prune them if they violate the quality constraint; we continue all the way to u Nn−1 n−1 , N n−1 = 0, . . . , N and finally to the node u Nn n , Nn = N . The shortest path to that final node corresponds to the optimal valid allocation of N counts to bins 1, . . . , n.
When solution optimality is not necessary, the T R problem can be solved more efficiently by the RH heuristic. RH differs from the RO algorithm in that it restricts the set of bins in the histogram H whose number of location visits can increase or decrease. Specifically, it works in a greedy fashion, iteratively "moving" frequency counts from source bins to destination bins. The source bins have higher frequency in H than in the target histogram H , whereas the destination bins have lower frequency in H than in H . Thus, moving counts from source to destination bins makes the sanitized histogram more and more similar to the target histogram, but it incurs a quality loss due to changes in frequency counts. Therefore, to control the loss of quality, moves are performed for as long as the quality distance of the resultant sanitized histogram from H does not exceed the quality threshold. Example 2.3 below illustrates the RO algorithm and the RH heuristic. The shortest path from u 0 0 to u N n with cost at most corresponds to the optimal solution of the T R problem. The nodes of this path correspond to the optimal way of allocating counts to all bins. The edge weights of this path are (privacy, quality loss) pairs (p 1 , q 1 ), . . . , (pn, qn). The two weights of an edge u are the privacy and quality effects of allocating exactly k counts to bin i + 1 of the solution histogram. The sum i∈ [1,n] p i quantifies the dissimilarity between the optimal solution and the target histogram (smaller is better) and the sum i∈ [1,n] q i quantifies the total quality loss, which should be at most divergence to measure dissimilarity from H and from H. The algorithm produces the sanitized histogram H RO in Figure 1d, which is as similar to H as allowed by the specified threshold. Similarly, Bob applies RH and obtains the sanitized histogram H RH in Figure 1e. Comparing H RO and H RH to H , we observe that H RO is very similar to H , while H RH is slightly less similar (e.g. the frequencies of f and g are equal in H and H RO , while they are not equal in H and H RH ). However, H RH is still useful for getting a good recommendation from the application, because the quality loss (dissimilarity to H) does not exceed .

Target Avoidance
Given a target histogram, a histogram satisfies the notion of target avoidance when it is dissimilar enough from the target. A privacy distance function quantifies similarity, and a privacy parameter quantifies the threshold for determining whether the two histograms are dissimilar enough.
Similarly to SLH and T R, any modification to the histogram must balance between dissimilarity to the target histogram, so as to achieve target avoidance, and similarity to the original histogram, so as to preserve quality.
For illustration and motivation, we refer again to Example 2.2. In that example, Bob can alternatively solve the Target Avoidance problem to directly avoid price discrimination. In so doing, the aim would be to avoid, not resemble, the target histogram, and so it would need to have as many visits to each of the locations f and g as H. For example, H could be identical to H.
The Target Avoidance optimization problem is formally defined in Section 3.4, in which we also describe precisely the adversary model.
The optimal algorithm AO and heuristic AH for solving the T A optimization problem are very similar to the ones described for T R above. The only essential difference is that T A algorithms compute a longest path in the graph instead of a shortest path as in T R.

Background, problem definitions, and adversary models
In this section, we define some preliminary concepts and then we formally define the SLH, T A, and T R optimization problems. A summary of the most important notation we introduce is in Table 1.

Preliminaries
We consider an area of interest, modeled as a finite set of semantic locations L = {L 1 , . . . , L |L| } of cardinality |L|, where a location L i , i ∈ [1, |L|], is e.g. "Italian Restaurant," "Cinema," or "Museum." We also consider a user who moves in this area. The user's histogram is a vector of integer frequencies H = (f (L 1 ), . . . , f (Ln)), where n ≤ |L| is the length of the histogram. Each location L i , i ∈ [1, n], has a frequency f (L i ) > 0, when L i was visited by the user, or f (L i ) = 0 otherwise. We may refer to frequencies as counts.
We use H[i] to refer to the i-th element, or bin, of H, and N , or size, to refer to the L 1 -norm |H| 1 = i∈ [1,n] H[i] of H. We use H n,N to denote the set of all histograms of length n and size N .
Having compiled H, the user wishes to submit it to a location-based application. Before submitting it, the user transforms it into a sanitized histogram H (in a way to be made concrete in Problems 3.1 and 3.2 below) and then submits H to the application. Next, the application returns a response to the user. Depending on the sanitization required, H may contain zero frequency counts for some locations, or it may contain nonzero frequency counts for locations that the user never visited. If the user wishes, we can easily guarantee that H will not contain nonzero frequency counts for locations that the user never visited, by assigning an infinite cost dq(H[i], H [i]) for each such location L i .

Quality loss
Since the user submits H , which is in general different from H, there will be a negative impact on the quality of the application response. The resulting loss in quality is measured by a quality distance function dq(H, H ). For every pair H, H , we require that dq(H, H ) ≥ 0, and that H = H implies dq(H, H ) = 0. In addition, we require dq to decompose as a sum over bins, i.e. there must be a function q such that dq(H, H ) = i∈ [1,n] ). Most distances used in data mining applications in which distances between histograms/vectors must be preserved (e.g., Jensen-Shannon divergence (JS-divergence) [39], Jeffrey's divergence [56], L 2 -distance (Euclidean distance) and Squared Euclidean distance [81], Variational distance [39], Pearson χ 2 distance [39], and Neyman χ 2 distance [39]) decompose as a sum over bins.
We use JS-divergence as the objective function dq in our experiments (see Section 5). JS-divergence is a standard measure for quantifying distances between probability distributions, which is often used in histogram/vector classification [45] and clustering [51]. Given two histograms H 1 , H 2 , the JS-divergence between them is defined as with the convention 0 · log 2 (0) = 0. JS-divergence is bounded in [0, 1] [39], and JS(H 1 , H 2 ) = 0 implies no quality loss. As explained in [36], JS-divergence can also be easily extended to capture semantic similarity requirements (e.g. An Italian Restaurant is more similar to a French Restaurant than to an American Cinema), when this is needed in applications. The extended measure, called smoothed JSdivergence, requires preprocessing the histogram by kernel smoothing and then applying JS-divergence to the preprocessed histogram. Incorporating smoothed JS-divergence in our methods is straightforward and left for future work.

The Sensitive Location Hiding problem: adversary model and formal definition
As discussed in the Introduction and in Section 2, the Sensitive Location Hiding (SLH) privacy notion aims to conceal all visits to sensitive locations. We formulate the adversary model and the desired privacy property for the SLH notion as follows.
The adversary knows: (I) the sanitized histogram H that the user submits, (II) the set of all possible sensitive locations L , and (III) the fact that, if H is fake, then it must have been produced by the LHO algorithm in our paper. The adversary has no other background knowledge. The adversary succeeds if, based on their knowledge, they manage to determine whether or not the user visited one or more of the sensitive locations in L .
The desired privacy property is the negation of the adversary's success criterion. That is, the adversary must not be able to infer, from the sanitized histogram, that the user has visited any of the sensitive locations.
We formally define the corresponding optimization problem as follows: Problem 3.1 (Sensitive Location Hiding (SLH)) Given a histogram H ∈ H n,N , a subset L ⊆ L of sensitive locations, and a quality distance function dq(), construct a sanitized histogram H ∈ H n,N that Intuitively, the SLH problem requires constructing a sanitized histogram by redistributing the counts of the sensitive locations of H into bins that correspond to nonsensitive locations, in the best possible way according to dq. The sensitive locations are specified by the user based on their preferences.
In the SLH problem formulation, we follow the user-centric (or personalized) approach to privacy that is employed in [1,3,15,61]. This approach requires the users to specify their own privacy preferences, so that these preferences are best reflected in the produced solutions. However, not all users may possess knowledge allowing them to identify certain locations in their histograms as sensitive. Yet, such users often know that a class of locations are sensitive, or they do not want to be associated with a class of locations [40,72]. For instance, several users may not want to be associated with visits to any type of clinic or adult entertainment location. In this case, users may employ a taxonomy 1 to identify classes of sensitive locations, which requires less detailed knowledge. This method is inspired by [40,72] and simply requires a user to select one or more nodes in the taxonomy. If a node u that is not a leaf is selected, then all locations corresponding to leaves in the subtree rooted at u will be considered as sensitive. If the selected node u is a leaf, then its corresponding location will be considered as sensitive. Such taxonomies already exist for location-based data, and they can also be automatically constructed based on machine learning techniques [67]. For example, in the Foursquare taxonomy (see Section 5), there is an aggregate category (internal node) "Medical center" which contains more specific categories (leaves) "Hospital," "Rehab center," etc. Clearly, the SLH problem seeks to produce a sanitized histogram H with the same size as H. As discussed in the Introduction, this allows preserving statistics that depend on the size of the histogram, which are important in location based applications, such as location recommendation. However, it is also possible to require the sanitized histogram H to have a given size instead (e.g., when an application requires a histogram to have a certain number of location counts, or in pathological cases where redistribution leads to undesirable/implausible histograms). This leads to a variation of the SLH problem, referred to as SLHr, which requires redistributing r ≥ 0 counts of sensitive locations into the bins corresponding to nonsensitive locations. Note the following choices for r in SLHr: (I) For r = 0, the SLHr problem requires constructing a sanitized histogram where each sensitive location has count 0 and each nonsensitive location has a count equal to that of its count in H. Such a histogram is trivial to produce, by simply replacing the count of each sensitive location with 0. (II) For r = Li∈L f (L i ) (i.e., equal to the total count of sensitive locations), SLHr becomes equivalent to the SLH problem. (III) For r > Li∈L f (L i ), the SLHr problem requires constructing a sanitized histogram with larger size than H. As we will explain in Section 4.1, it is straightforward to optimally solve SLHr based on our LHO algorithm.

Solutions to the SLH optimization problem satisfy the desired privacy property
The adversary cannot distinguish between a user A who has only visited nonsensitive locations and thus submits a non-sanitized histogram H A , and a user B who has visited some sensitive locations and the algorithm has produced a sanitized histogram H B that is identical to H A . This is because every possible sanitized histogram that the LHO algorithm can output is a valid histogram that could have legitimately been produced by a user. Note that, if there are histograms that cannot be produced by a legitimate user, LHO can be trivially adapted to never output such histograms. This adaptation is easy because all histograms are encoded as paths in a graph, so illegitimate histograms are also paths in the graph, referred to as illegitimate paths, and these histograms can be avoided by simply changing the shortest-path finding algorithm to an algorithm that finds a shortest path which is not contained in a given subset of illegitimate paths [55].

The Target Resemblance problem: adversary model and formal definition
As discussed in the Introduction and in Section 2, for the Target Resemblance (T R) privacy notion the user specifies a target histogram H to resemble, a quality parameter and a privacy parameter c. The objective of the T R optimization problem is to create a sanitized histogram H that is as similar as possible to H , subject to the quality constraint dq(H, H ) ≤ . The privacy distance function that quantifies the notion of similarity is denoted dp(H , H ). If dp(H , H ) > c, then H is not acceptable, because it is not similar enough to the target.
The function dp(H , H ) is nonnegative and it must decompose as a sum over bins, i.e. there must be a function p such that dp(H , H ) = i∈ [1,n] , using zeros to fill in missing location counts. In T R, privacy is maximum when H = H (dp(H , H ) = 0), because there is no better resemblance than being identical. Any function with these properties would be suitable as dp (e.g., JSdivergence, or L 2 -distance). We use JS-divergence as dp in our experiments (see Section 5).
We can formulate the adversary model and the desired privacy property for this problem as follows: The adversary knows (I) the histogram H that the user submits, (II) a target histogram H , (III) a privacy distance function dp(), and (IV) a privacy parameter c.
Upon receiving H , the adversary compares it to the target H in order to profile the user. For example, if an adversary wants to determine whether the user is a member of a particular ethnic/religious/social group, the target histogram is the histogram of a typical member of that group. Formally, the adversary makes this determination by comparing dp(H , H ) to c, i.e., by comparing the privacy distance between the user's submitted histogram H and H to the privacy parameter c. If dp(H , H ) ≤ c, the adversary concludes that the user is a member of the group, otherwise they conclude that the user is not a member of the group. The adversary has no other background knowledge. In particular, the adversary does not know whether the user submitted their true histogram or the user submitted a modified histogram aiming to resemble a particular target histogram. The adversary succeeds if they conclude that the user is not a member of the group, i.e. dp(H , H ) > c.
The desired privacy property is the negation of the adversary's success criterion. In T R, the desired privacy property is dp(H , H ) ≤ c.
We formally define the corresponding optimization problem as follows: Problem 3.2 (Target Resemblance (T R)) Given two histograms H, H ∈ H n,N , a privacy distance function dp(), a privacy parameter c, a quality distance function dq(), and maximum quality loss threshold ≥ 0, construct a sanitized histogram H ∈ H n,N that minimizes If the resulting H is such that dp(H , H ) > c, then it is impossible to achieve both the desired privacy property and the desired quality constraint.
Intuitively, the T R problem requires constructing a sanitized histogram H of the same length and size with H and H that offers the best possible privacy by being as similar as possible to the target histogram H according to dp, while incurring a quality loss at most according to dq.
The function dq is selected by the location-based application provider (recipient of the sanitized histogram) and is provided to the user together with an intuitive explanation of what different values of dq() mean for quality. For example, dq() ≥ 0.8 means "very low quality", 0.6 ≤ dq() ≤ 0.8 means "low quality" etc., where "quality" refers to the quality of the application response (e.g. recommendation) that the user receives. Then, in the spirit of user-centric (or personalized) privacy [22,62], the user uses the above explanation by the provider to choose a value of that corresponds to his/her tolerance for quality loss. The problem requires the user to specify the target histogram H . However, some users may not possess sufficient knowledge to perform this task, even though they want to resemble a person with certain characteristics (e.g., a wealthy person). In these cases, H can be constructed as follows. The user chooses a target probability distribution h from a repository of probability distributions that are constructed by domain experts and labeled accordingly (e.g., a distribution corresponding to a "wealthy" profile, a "tourist" profile, a "healthy person" profile [28,68]), in the same way that experts compile e.g., adblock filters (lists of URLs to block) or lists of virus signatures for antivirus software. To choose one of these profiles, the user looks for a label that they want to resemble. This setup is very similar to other papers in the literature [1,15]. Note is not necessarily an integer. Strictly speaking, this violates the requirement of histograms to have integer counts, but that is not a problem for our methods, because the privacy distance functions do not need integer arguments. However, we do require the histogram H that the algorithms output to have integer counts.
Clearly, the T R problem requires constructing a sanitized histogram H with the same size as H and H . That is, it assumes that the desirable target histogram H has the same counts as H, but these counts are distributed differently from H. However, it is also possible to relax this assumption. This leads to a variation of the T R problem, referred to as T R |H |1 , which instead requires the sanitized histogram H to only have the same size as H , while it can be different from the size of H. It is straightforward to optimally (resp., heuristically) solve T R |H |1 based on our RO algorithm (resp., based on our RH heuristic) (see Section 4.2).

Solutions to the T R optimization problem satisfy the desired privacy property
The T R problem tries to minimize dp(H , H ), while satisfying the quality constraint dq(H, H ) ≤ . Of course, a particular choice of affects privacy. If is low, an algorithm for the T R problem may output an H that is the same or very similar to H, because all histograms that satisfy the specified quality constraint are close to H. Then, the user has to decide whether this H is safe to release.
Given the privacy parameter c, it is not safe to release H when dp(H , H ) > c. If dp(H , H ) > c, the user will decide not to release any histogram at all. Alternatively, the user may want to re-run the algorithm with a larger , i.e. to sacrifice more quality in order to achieve the privacy requirement.
The user's decision may depend on the intuitive meaning of the function used for dp. For example, if dp is Pearson χ 2 and the target H models a "wealthy" user, then dp(H , H ) quantifies how much more likely it is that H has been produced by a user who follows the "wealthy" profile compared to any other profile 2 . Thus, if this likelihood ratio exceeds c, then the user may not want to release that H .
It is also trivial to exclude solutions with dp(H , H ) > c by modifying our methods to disregard such solutions and terminate if no solution exists. In conclusion, the user either submits a histogram that satisfies the privacy property, or nothing at all.

The Target Avoidance problem
As mentioned above, Target Avoidance (T A) is a variant of the Target Resemblance (T R) problem, which we briefly discuss below.
If the resulting H is such that dp(H , H ) < c, then it is impossible to achieve both the desired privacy property and the desired quality constraint.
Intuitively, the T A problem requires constructing a sanitized histogram H of the same length and size with H and H . The sanitized histogram must offer the best possible privacy by being as dissimilar as possible to the target histogram H according to dp, while incurring a quality loss at most according to dq. The threshold and target histogram H are specified by the user based on their preferences. For example, the user can set H to H, in order to avoid H itself, or to a part of H that contains the locations that characterize an undesirable profile (e.g., frequent visits to airports) or are frequented by a certain ethnic minority (which may help infer that an individual belongs to the minority). The user could also choose H with the help of domain experts, as in the T R problem.
In terms of an adversary model, the adversary has the same knowledge as in T R and they succeed if dp(H , H ) < c. If the algorithm does not produce an H such that dp(H , H ) ≥ c, then the user can either not submit any histogram at all, or the user may want to re-run the algorithm with a larger . The proof that the T A problem leads to a solution satisfying the desired privacy property is similar to that for T R (omitted).
The T A problem is very similar to the T R problem. This is established through a reduction from T A to T R that is given in Appendix (Section A.4). There is also a variation of T A, referred to as T A |H |1 , which requires the sanitized histogram H to have the same size as H , but not necessarily as H. Again, our methods can easily deal with this variation.
Since the SLH, T R and T A problems are weakly NP-hard, it is possible to design pseudopolynomial algorithms 3 to optimally solve them. We present optimal algorithms based on shortest/longest path problems for the SLH, T A, and T R problems. In addition, we present heuristic algorithms for the T R and T A problem. The heuristics find solutions of comparable quality to those of the optimal algorithms but are more efficient by two orders of magnitude. Furthermore, we explain how our methods can deal with the variation SLHr and T A/T R |H |1 of the SLH and T A/T R problem, respectively.

LHO: An optimal algorithm for SLH
This section presents Location Hiding Optimal (LHO), which optimally solves the SLH problem. Before presenting LHO, as motivation, we consider a simple algorithm which distributes the counts of the sensitive location(s) to the nonsensitive bin(s) proportionally to the counts of the non-sensitive bins. Thus, it aims to construct an H by initializing it to H and then increasing the count of each non-sensitive bin N −K , while assigning 0 to each sensitive bin. While intuitive, this algorithm fails to construct an H , for a given histogram H and distance function dq, when x[i] is not an integer, and also it may lead to solutions with large dq(H, H ) (i.e., low data utility), as it does not take into account the input distance function dq.
We now discuss the LHO algorithm. Without loss of generality, we assume that the nonsensitive locations correspond to the first n − |L | bins of the original histogram H = (f (L 1 ), . . . , f (Ln)), while the remaining |L | bins correspond to the sensitive locations. The total count of sensitive locations in H is K = Li∈L f (L i ). LHO must move (redistribute) these counts into the nonsensitive bins, while minimizing the quality error dq().
The LHO algorithm is founded on the following observation: The optimal way of redistributing counts to each nonsensitive bin of H corresponds to a shortest path between two specific nodes of a search space graph G LHO (V, E), where V and E is the set of nodes and set of edges of G LHO , respectively. In the following, we discuss the construction of G LHO and the correspondence between this shortest path and the solution to the SLH problem. Then, we discuss the LHO algorithm.
... (2,K) ... Each edge connects nodes of consecutive layers and has a weight equal to the error E i+1,k , where i + 1 is the layer of the end node of the edge and k is the count of sensitive locations. E i+1,k represents the impact of redistributing (i.e., adding) k counts into the i + 1 bin of the sanitized histogram H , which is initialized to the original histogram H. That is, The missing nodes and edges are denoted with ". . ." G LHO is a multipartite directed acyclic graph (DAG) (see Figure 4) such that: -It contains n−|L |+1 layers of nodes. Layer 0 comprises a single node, and layers 1, ..., n − |L | comprise K + 1 nodes each. Each layer 1, ..., n − |L | corresponds to a nonsensitive bin. -The single node in layer 0 is labeled (0, 0), and each node in is labeled (i, j), where j denotes the redistribution (i.e., addition) of j counts to bins 1 up to and including i of the sanitized histogram. We may refer to nodes of G LHO using their labels.
-There is an edge ((i, j), (i + 1, j + k)) from node (i, j) to node (i + 1, j + k), for That is, each node labeled (i, j) is connected to every node in the following layer i + 1 that corresponds to a count of at least j. -Each edge ((i, j), (i + 1, j + k)) is associated with a weight equal to the error The error E i+1,k quantifies the impact on quality that is incurred by redistributing (i.e., adding) k counts into bin i + 1.
Let P be a path comprised of nodes (0, 0), (1, k 1 ), . . ., (n − |L |, k n−|L | ) of G LHO . The properties below easily follow from the construction of G LHO : (I) The path P corresponds to an addition of k i − k i−1 counts to the i-th bin of the histogram, for each i ∈ [1, n − |L |], where k 0 = 0. (II) The length of P is equal to the total weight E 1,k1 + . . . , E n−|L |,k n−|L | of the edges in P . This total weight is the total quality loss incurred by the allocation corresponding to P .
Thus, the path P corresponds to a sanitized histogram H whose first n − |L | bins have counts H Conversely, each possible allocation of the K sensitive counts into nonsensitive bins corresponds to a path between the nodes (0, 0) and (n − |L |, K) of G LHO , which represents a feasible solution to the SLH problem. Therefore, the shortest path between the nodes (0, 0) and (n − |L |, K)) of G LHO (i.e., the path with the minimum length E 1,k1 + . . . , E n−|L |,K ; ties are broken arbitrarily) represents a sanitized histogram , which is the optimal solution to SLH. This is because H has minimum dq(H, H ), the same size with H, and a zero count for each sensitive location.
We now present the pseudocode of the LHO algorithm. In step 1, the algorithm constructs the search space graph G LHO . In step 2, the algorithm finds a shortest path between the nodes (0, 0) and (n − |L |, K) of G LHO . In step 3, the sanitized histogram H corresponding to the shortest path (i.e., the optimal solution to the SLH problem) is created and, last, in step 4, H is returned.  Figure  1a. The set of sensitive locations L contains the locations g and h with counts 8 and 3, respectively, and the quality distance function dq is JS-divergence. In step 1, the algorithm constructs the search space graph in Figure 5. The graph has n − |L | + 1 = 7 layers of nodes, where n = 8 is the length of H and |L | = 2 is ... (2,11) ...
The time complexity of the LHO algorithm is O (n − |L |) · K 2 + S , where (n − |L |) · K 2 is the cost of constructing G LHO (step 1) and S is the cost of finding the shortest path (step 2). Constructing G LHO takes O (n − |L |) · K 2 time, because G LHO contains O (n − |L |) · K nodes and O K + 1 + (n − |L | − 1) · ( K 2 ) = O (n − |L|) · K 2 edges, and the computation of each edge weight E i+1,k takes O(1) time, because it is computed by accessing a single pair of bins from H and H . The cost S is determined by the shortest path algorithm. For example, it is O (n − |L|) · K 2 · log((n − |L |) · K 2 ) for Dijkstra's algorithm with binary heap [58].
Last, we note how the variation SLHr of the SLH problem (see Section 3.2) can be optimally solved with LHO. This is possible by simply setting the parameter K in LHO to r (i.e., constructing a search space graph whose layers 1, . . . , n − |L | comprise r+1 nodes each, and then finding a shortest path from (0, 0) to (n−|L |, r) and the histogram H corresponding to the path).

Optimal algorithms for Target Resemblance and Target Avoidance
Although the two problems of Target Resemblance (T R) and Target Avoidance (T A) are different in their privacy motivation, they are mathematically very similar (see Sections 3.3 and 3.4). In this section, we model and solve the T R problem as a constrained shortest path problem on a specially constructed search space graph G T R . It follows immediately that the T A problem can be seen as a longest path problem on the same graph. Because the graph is a directed acyclic graph (DAG), computing longest and shortest paths has the same complexity [58]: By visiting the graph nodes in Breadth-First Search order, we can simply keep track of the shortest (or longest) path to each node. We can even solve the two problems in one pass. To keep the presentation simple, we focus on the T R problem, which we solve optimally by the Resemblance Optimal (RO) algorithm.
In the following, we discuss the construction of G T R and then provide the pseudocode of RO.  ( p e r r , q e r r ) n ,1 (p err ,q err ) n,0 Fig. 6: Search space graph G T R for the Target Resemblance problem. Layer 0 is an auxiliary layer that just contains the node (0, 0). Layer i = 1, . . . , n corresponds to bin i of the sanitized histogram, and node (i, j) corresponds to allocating j = 1, . . . , N counts to bins 1 up to and including i. A path from (0, 0) to (n, N ) completely defines an allocation of N counts to n bins. The weight of the edge from (i, j) to (i + 1, j + k) is the privacy and quality error of allocating exactly k counts to bin i + 1. As these errors are additive, the admissible paths are those whose total q-length is less than the threshold . Among them, the p-shortest path from (0, 0) to (n, N ) corresponds to the optimal solution to T R, because it also has q-length at most From the histogram H and the distance functions dp and dq, we construct a multipartite DAG G T R = (V, E), as follows (see also Figure 6): -There are n · (N + 1) + 1 nodes in V , where n and N are the length and the size of the histogram H, respectively. -The nodes are arranged in layers 0, 1, ..., n, with layer 0 having a single node and layers 1, ..., n having N + 1 nodes each. Layer i ∈ [1, n] corresponds to bin i (location L i ) in the histogram. Node j ∈ [0, N ] in layer i corresponds to the allocation of a total of j frequency counts to histogram bins 1 up to and including i. -The single node in layer 0 is labeled (0, 0), and each node j in every other layer i is labeled (i, j). We may refer to nodes of G T R using their labels. -The edges in E go from each node (i, j) to each node (i+1, j+k), k ≥ 0, j+k ≤ N , i.e. to each node in the following layer that has a frequency count at least equal to j. -The weight of an edge from (i, j) to (i + 1, j + k) is the pair (perr, qerr) i+1,k of the privacy and quality errors of allocating exactly k counts to bin i + 1 of the sanitized histogram: perr = p(k, H [i + 1]), qerr = q(H[i + 1], k). The p-length of a path is the sum of its perr weights. We will refer to the path with the minimum p-length as the p-shortest path. Similarly, the q-length of a path is the sum of its qerr weights.
At this point, note two important differences between the edge weights of G T R and G LHO (Section 4.1): First, and most obvious, the edge weights in G T R are pairs of (privacy error, quality error), whereas in SLH the weights are quality errors. Second, in G T R the weight of edge from (i, j) to (i + 1, j + k) corresponds to setting H [i + 1] exactly equal to k, whereas in G LHO that edge weight would correspond to setting H [i + 1] equal to H[i + 1] + k.
From the construction of G T R , it follows that there is a 1 − 1 correspondence between a sanitized histogram H ∈ H n,N and a path from (0, 0) to (n, N ) in G T R . Therefore, to solve the T R problem, we need to find the path from (0, 0) to (n, N ) with minimum p-length among the paths whose q-length is at most . Then, it is straightforward to construct the histogram from the path.
We now provide the pseudocode of the RO algorithm. We assume that the preprocessing needed to construct H from h is done before the actual algorithm runs, and also H and H have been expanded to be defined on the same set of locations, if needed (see Section 3.3 for details on h ). Also, for the moment, we assume that dq takes nonnegative integer values.  , (n, N )) ← the shortest path from node (0, 0) to node (n, N ) in G T R . Its p-length is equal to the minimum element of Vv for node v = (n, N ).
In step 1, RO constructs the graph G T R . In steps 2 to 6, the algorithm iterates over each node v of the graph and associates with it a vector Vv, indexed by all possible values of dq. The elements of Vv are initialized to 0 for node (0, 0), and to ∞ for any other node of G T R . Next, in steps 7 to 9, RO iterates over the nodes of G T R in increasing lexicographic order, starting from node (1, 0), and for each node v it updates all the elements of Vv. Each element Vv[k] is updated using the following dynamic programming equation: The element Vv[k] is equal to the p-length of the p-shortest path from (0, 0) to node v with q-length exactly equal to k. Thus, as explained above, this path is a feasible solution to the T R problem, and so the optimal solution to T R is the p-shortest path from (0, 0) to (n, N ) (i.e., the path corresponding to the minimum element of the vector V (n,N ) ). The nodes of this path are found in step 10 and its corresponding histogram H is constructed in step 11. We now consider the general case in which the values of q-length of a path from (0, 0) to (n, N ) are not necessarily integer. We first show that the q-length of this path is polynomial in N , in Theorem 4.1 below. Then, we show that the number of values of q-length for all paths is polynomial in N , which implies that these values are not too many to store in the vectors Vu.   Figure 1c, JS-divergence as the quality distance function dq and the privacy distance function dp, and = 0.05. In step 1, the algorithm constructs the search space graph in Figure 7. The graph has n + 1 = 9 layers of nodes, where n = 8 is the length of H. Layer 0 contains the node (0, 0) and each other layer contains N + 1 = 51 nodes, where N = 50 is the size of H. Each node in layers 1, . . . , 8 is labeled (i, j); i ∈ [1,8] denotes the layer of the node and corresponds to bin i, while j ∈ [0, 50] denotes the counts allocated to bins 1, . . . , i. For example, the node (8, 50) denotes that all 50 counts of H are allocated to bins 1, . . . , 8. In addition, there is an edge from each node (i, j) to every node (i + 1, j + k), for each k ∈ [0, 50 − j]. The edge weight is a pair (perr, qerr), where the privacy error perr (respectively, quality error qerr) quantifies the error with respect to JS-divergence that is incurred by allocating k counts to bin i +1 of the sanitized histogram (see Figure 7). For example, the node (0, 0) is connected to the nodes (1, 0), . . . , (1,50), and the edge ((0, 0), (1, 10)) has (perr, qerr) 1,10 = (0, 3.8 · 10 −3 ), incurred by allocating 10 counts to the first bin. In steps 2 to 9, RO computes the vector Vu for each node u. In step 10, the algorithm finds the shortest path from node (0, 0) to node (8,50) with qerr ≤ (see Figure  7), and in step 11 it constructs the sanitized histogram H = (10, 6, 5, 2, 14, 5, 5, 3) that corresponds to the shortest path (see Figure 1d). Note that j in the label (i, j) of each node in the shortest path corresponds to the counts of sensitive locations that are allocated to bins 1, . . . , i in H . Last, in step 12, H is returned. ... (2,50) ...  The time complexity of the RO algorithm is O((n · N ) 2 · ( N +n−1 n−1 )). The total cost is the sum of the cost of constructing G T R and of finding the constrained shortest path from (0, 0) to (n, N ).
The construction of G T R takes O(n · N 2 ) time. This is because the algorithm constructs O(n · (N + 1) + 1) = O(n · N ) nodes, each of which has O(N ) outgoing edges, for a total of O(n · N 2 ) edges. Note also that the computation of each edge weight takes O(1) time. The cost of computing the shortest path is O((n · N ) 2 · ( N +n−1 n−1 )). This is because it requires (I) constructing a vector Vv with O(n · ( N +n−1 n−1 )) entries, for each node v of the O(n · N ) nodes of G T R , which takes O(n 2 · N · ( N +n−1 n−1 )) time, and (II) updating each entry of Vv once, which takes O(N ) time per node since there are O(N ) incoming edges to each node (see Eq. 4.1), for a total of O((n · N ) 2 · ( N +n−1 n−1 )) across all nodes. Last, we note how the variation T R |H |1 of the T R problem (see Section 3.3) can be optimally solved with RO. This is possible by simply using RO to allocate |H | 1 counts instead (i.e., construct a search space graph whose layers 1, . . . n have |H | 1 + 1 counts each, and then find the shortest path from (0, 0) to (n, |H | 1 ) and the sanitized histogram H corresponding to the path).

Heuristics for Target Resemblance and Target Avoidance
Our heuristics, RH for the Target Resemblance and AH for the Target Avoidance problem, work in a greedy fashion to avoid the cost of constructing and searching the search space graph. We first discuss the RH heuristic. The main idea in RH is to try to greedily reduce the differences in the counts of corresponding bins between H and H . As can be seen in the pseudocode (steps 1 and 2), RH identifies source bins, i.e. bins in H with more counts in H than in H , and destination bins, bins with fewer counts in H than in H . Bins with equal counts in H and H are ignored. Then, in steps 3 and 4, the sanitized histogram H is initialized to the original histogram H and the remaining quality budget rem to the quality threshold . In steps 5 and 6, RH moves some counts from a source bin to a destination bin using a function BestM ove.
As can be seen in the pseudocode of BestM ove (steps 4 to 6), the function performs an exhaustive search of all possible ways ("moves") to move k counts from a source bin i to a destination bin j. For each move, BestM ove computes the privacy effect ∆dp and the quality effect ∆dq (steps 10 and 11), and it selects the move that maximizes the ratio ∆dp ∆dq , subject to the constraint that ∆dq cannot exceed the remaining quality budget rem (steps 12 to 15) 4 . The rationale is to prioritize moves with a large improvement in privacy ∆dp and a small reduction in quality ∆dq.
Next, in step 7, RH checks whether the remaining budget is exhausted. If it is, no more moves are performed (step 8). Otherwise, in steps 9 to 11, RH reduces the quality budget by Opt∆ dq (i.e., by the quality effect of the best move), and updates the sets of source and destination bins by no longer considering as source or destination bins any bins whose count has become equal to the corresponding bin in H . Moves continue until the budget is exhausted or there are no more source/destination bins. Since moves cannot increase the count of a source bin nor increase the remaining quality budget, RH will always terminate.
The main idea in AH is very similar. The difference is in the definition of source and destination bins: AH aims to make H dissimilar to H , and so it tries to increase differences in counts between bins in H and corresponding bins in H . Hence, source bins (from which counts will be taken) are bins that have a shortage of counts (fewer counts in H than in H ), whereas destination bins are the ones with a surplus of counts. Unlike in RH, bins with equal counts in H and H are source bins as well as destination bins (see steps 1 and 2 in the pseudocode of AH, in which source and destination bins are initialized, and steps 10 and 11 in which source and destination bins are updated). AH then proceeds as RH, only making sure never to use the same bin as both source and destination.
The time complexity of RH and AH is O(n 3 · N ). This is because the loop in step 5 runs O(n) times (once per source bin), and each time there is a cost of O(n 2 · N ) incurred by BestM ove. The cost of BestM ove is O(n 2 · N ), because there are O(n 2 ) source/destination bin pairs, and for each pair O(N ) temporary moves are performed. The time complexity analysis refers to the worst case. In practice, a histogram can be sanitized with a smaller number of moves (i.e., executions of Last, we note that the variation T R |H |1 of the T R problem (see Section 3.3) can be directly dealt with by the RH heuristic. This is because RH does not pose any restriction on the size of H , so H can have a different size than that of H. Similarly, the variation T A |H |1 of the T A problem (see Section 3.4) can be directly dealt with by the AH heuristic.

Evaluation
In this section, we evaluate our approach in terms of effectiveness and efficiency. We do not compare against existing histogram sanitization methods, because they cannot be used to solve the Sensitive Location Hiding or the Target Avoidance/Resemblance problem (see Related Work in Section 6.2).

Setup and datasets
To calculate the loss in quality (utility) incurred by replacing the original histogram H with the sanitized histogram H , we compute the distance dq(H, H ), where dq is either the Jensen-Shannon (JS) divergence (Section 3.1.1), or the L 2 distance. Our algorithms can optimize either of these measures. However, we present results for optimizing JS divergence, because, as we show, the two measures lead to qualitatively similar results.
In addition, we measure how well sanitization preserves the quality of two applications: (I) (histogram) clustering, and (II) location recommendation. Clustering partitions the elements (frequencies) of a histogram into clusters, so that each cluster contains similar frequencies. Clustering is typically applied to visualize (long) histograms, or to segment locations into pre-defined categories based on frequency of visits. For example, clustering a histogram H = (1, 2, 3, 98, 99, 100) may result in two clusters, (1,2,3) and (98, 99, 100), corresponding to "rarely-visited" and "frequently-visited" locations, respectively. Clustering quality is measured using Normalized Conditional Entropy (N CE), a normalized version of the well-known Conditional Entropy cluster quality measure [81]. We first provide the definition of Conditional Entropy and then that of N CE. Let C (respectively, C ) be a partition of the elements of H (respectively, H ) into k mutually disjoint and nonempty clusters. The partitions C and C will also be referred to as clusterings. The Conditional Entropy H(C |C) = c ∈C ,c∈C Pr(c, c ) · ln( Pr(c) Pr(c,c ) ) is the entropy of C conditioned on C. Pr(c) = |c| N is the probability that a randomly selected element of H is contained in the cluster c ∈ C, Pr(c ) = |c | N is the probability that a randomly selected element of H is contained in the cluster c ∈ C , and Pr(c , c) = |c ∩c| N is the probability that a randomly selected element is contained in c ∩ c. Intuitively, H(C |C) quantifies the amount of information needed to describe C , when C is known. The Normalized Conditional [81], N CE(C |C) is bounded in [0, 1], with 0 implying that the two clusterings are the same and 1 that they are independent. We apply the optimal dynamic-programming algorithm of Wang and Song [69] to produce C and C with k = 3. Results with different k are similar (omitted). Given a histogram H and an integer k, the algorithm produces a clustering C that minimizes c∈C H[i]∈c (L 2 (H[i],c)) 2 , where c is a cluster,c is the mean of the elements in the cluster, and L 2 is the L 2 distance.
Location recommendation suggests to a user, referred to as active user and denoted with α, a location that might interest them. A popular location recommendation approach is user-based Collaborative Filtering (CF), which works as follows (see [44] for details): (I) It finds a set U α,k = {u 1 , . . . , u k } of k users who are the most similar to the active user α, with respect to their histograms. The similarity between a user u and the active user α is measured by the Pearson correlation coefficient, which is defined as P CC(u, α) = where Lu,α is the set of locations in the histograms of both u and α, fu(L i ) (respectively, fα(L i ) ) is the number of times the user u (respectively, α) visited location L i (i.e., the count of the location L i in the histogram of a user), and µu (respectively, µα) is the average number of times u (respectively, α) visited a location. (II) For the active user α and each location L i , it computes a recommendation score (predicted frequency of visiting L i ), defined as rα(L i ) = µα + u∈U α,k (fu(Li)−µu)·P CC(u,α) u∈U α,k P CC(u,α) .
(III) It recommends to the active user the location with the maximum recommendation score. In our experiments, we use the aforementioned user-based CF method with k = 25. The recommendation error [44] for an active user α and a location L α test is defined as the difference between fα(L α test ), the user's true frequency of visits to L α test , and rα(L α test ), the frequency of visits as predicted by the recommendation algorithm based on the given dataset. Below we use both the absolute error |fα(L α test ) − rα(L α test )| and the square error (fα(L α test ) − rα(L α test )) 2 . To capture the impact of the sanitization algorithm on recommendation quality, we compute the above defined recommendation error in two ways: First, we compute recommendations based on the dataset of original user histograms, and then based on the dataset of sanitized histograms. Clearly, the impact of sanitization is small when the average recommendation error for the dataset of original user histograms is similar to that for the dataset of sanitized histograms. In particular, we perform the following steps on each of the two datasets (we use the absolute difference as the recommendation error, but the same steps apply for the squared difference): (I) We randomly partition the dataset into 2 subsets, a training set D train with 90% of the histograms and a test set D test with 10% of the histograms. (II) For an active user α in the test set D test , we randomly select a location L α test in the histogram of α. We compute the similarities P CC(u, α) between α and all users u in the training set D train (see step (I) of the Collaborative Filtering algorithm). In the computation of P CC, we exclude L α test from the set of locations Lu,α that are in the histograms of both u and α. Then, we compute the recommendation score rα(L α test ), and finally we compute the error |fα(L α test ) − rα(L α test )|. (III) We compute the absolute recommendation error considering each user in the test dataset D test as the active user α, and then we average the errors to obtain For the square error, the average we compute is the Root Mean Square Error All algorithms are implemented in Python and applied to the New York City (NYC ) and Tokyo (TKY ) datasets. The datasets were downloaded from [77] and include long-term check-in data in New York city and Tokyo, collected from Foursquare from 12 April 2012 to 16 February 2013. The datasets have been used in several prior works [49,78,79]. Each record in the datasets contains a location that was visited by a user at a certain time and corresponds to a leaf in the Foursquare taxonomy (available at https://developer.foursquare.com/docs/ resources/categories). There are in total 713 locations in the taxonomy, and on average each user visits fewer than 41 locations. For each dataset, we produce the input histograms for our algorithms by constructing one histogram H per user. The histogram H contains a count f (L i ) > 0 for every location L i visited by the user. That is, H is constructed based on the user's values (location visits), which is line with histogram sanitization methods [2,18,31,34,56,76,84]. Table 2 shows the characteristics of NYC and TKY, and Table 3    We also construct synthetic histograms containing some bins whose frequency is equal to zero. These bins correspond to locations that are not visited by a user but should be considered in sanitization, e.g. to allow the sanitization algorithm to redistribute frequency counts to locations that are not visited by the user. The synthetic histograms are constructed by appending zeros to a histogram of length n = 78 and size N = 192 in NYC and to a histogram of n = 99 and N = 642 in TKY, and their length is up to 400, including the zero-frequency bins. We use the synthetic histograms to test the impact of length on the runtime performance of our methods. In total, we test the algorithms on approximately 3400 different histograms. For all experiments we use an Intel Xeon at 2.60GHz with 256GB of RAM.

Evaluation of the LHO algorithm
We evaluate the quality and runtime performance of the LHO algorithm as a function of (I) n, the length of the original histogram, (II) K, the total frequency of sensitive locations, and (III) |L |, the number of sensitive locations. We consider JS-divergence, L 2 distance, and N CE as measures of quality loss dq(). The results for the L 2 distance are in Appendix A.5, because they are similar to those for JSdivergence. Unless otherwise stated, the set of sensitive locations L is constructed by selecting 5 sensitive locations uniformly at random. . This is because there are more bins whose counts may increase: The space considered by LHO is larger and the change can be "smoothed" over more bins. In addition, the JS-divergence scores are relatively low (recall that JS-divergence takes values in [0,1]). This suggests that sanitization preserves the distribution of nonsensitive locations fairly well. We also show that N CE decreases with n, in Figure 9. This is because there are more bins whose frequency does not change, or it changes slightly so that they are not moved into a different cluster. In addition, the N CE scores are relatively low (recall that N CE takes values in [0,1]). This suggests that the clustering quality is preserved fairly well after sanitization. Impact of total frequency of sensitive locations K We show that JS-divergence increases with K in Figure 10 (the yaxes are in logarithmic scale). This is because there are more counts that need to be redistributed into the bins of nonsensitive locations, and this incurs a larger amount of distortion. In addition, the JS-divergence scores are relatively low. This suggests that the distribution of nonsensitive locations is preserved fairly well after sanitization. We also show that N CE increases with K, in Figure 11. This is because a larger K incurs larger changes to the frequency of the nonsensitive locations, which negatively impact the quality of clustering. The N CE scores are relatively low (on average 0.15 and 0.25 for NYC and TKY, respectively). This suggests that the clustering quality is preserved fairly well. The high scores for K ≥ 60 in the case of TKY are obtained because the histograms have small total frequency (i.e., K corresponds to 46% of the total frequency of all locations on average). In this case, the impact of sanitization on the histogram is inevitably large. We also show that N CE increases with |L |, in Figures 12c and 12d. This is because a larger |L | causes larger changes to the frequency of the nonsensitive locations, which negatively impact the quality of clustering. However, the N CE scores are relatively low, which suggests that the clustering quality is preserved fairly well. Specifically, sanitization does not affect at all the clustering quality (i.e., N CE = 0) for approximately 70% of histograms when |L | = 1 and for approximately 20% when |L | = 10. The median N CE was 0, 0, 0.14, and 0.3 for |L | equal to 1, 2, 5, and 10, respectively.
Recommendation quality We investigate the impact of sanitization on recommendation quality by using test datasets of original histograms vs test datasets of histograms that are sanitized with different values of |L |. Figures 13a and 13b show that M AE and RM SE are not substantially affected by sanitization, for all tested |L | values. The change in M AE and RM SE is on average 0.1% and 2.4%, respectively. This suggests that recommendation quality is preserved fairly well.

Runtime performance for the LHO algorithm
We evaluate the runtime performance of LHO as a function of (I) n, histogram length, (II) K, total frequency of sensitive locations, and (III) |L |, number of sensitive locations. To isolate the effect of each parameter, we vary just one and keep the other two fixed. We then examine the joint impact of all three parameters, which is given by the time complexity formula O (n − |L |) · K 2 · log((n − |L |) · K 2 ) , because we used Dijkstra's algorithm with binary heap to find shortest paths (see Section 4.1). For brevity, we use λ to denote (n − |L |) · K 2 · log((n − |L |) · K 2 ). Thus, we expect the runtime to be linear in λ. Impact of length n We show that runtime increases with n, in Figures 14a and 14b. This is because, when n is larger, there are more bins into which the counts may be redistributed. More bins means that the multipartite graph G T R , created by the LHO algorithm, has more layers (and consequently more nodes and edges). Note also that runtime increases linearly with n (i.e., the linear regression models in Figures 14a  and 14b are good fit), as expected by the time complexity analysis (see Section 4.1), and that the algorithm took less than 3 seconds. We also show that runtime increases linearly with n when the algorithm is applied to the synthetic histograms, which are more demanding to sanitize (see Figures 14c and 14d).
Impact of total frequency of sensitive locations K We show that runtime increases with K, in Figures 15a and 15b. This is because there are more counts that are redistributed into the bins of nonsensitive locations when K is larger. That is, the graph G T R contains more edges and nodes. Note also that the runtime increases approximately quadratically with K (i.e., the quadratic regression models in Figures 15a and 15b are good fit), as expected by the time complexity analysis (see Section 4.1), and that the algorithm took less than 100 seconds. Impact of number of sensitive locations |L | We show that runtime increases with |L |, in Figures 16a and 16b, which report results for each histogram in NYC and TKY, respectively. This is because there are (I) more counts that need to be redistributed into the bins of the nonsensitive locations, and (II) fewer bins to which the counts may be redistributed to, and, as demonstrated above, the impact of more counts on runtime is larger than that of fewer bins (quadratic increase vs linear decrease). For example, 95% of the histograms in the NYC dataset take less than 1 second to be sanitized when |L | = 1, but the corresponding percentage was 25% when |L | = 10. However, the algorithm remains relatively efficient even for |L | = 10, with 99% of the histograms in NYC requiring less than 5 minutes to be sanitized.
Joint impact of n, K, |L | In Figures 16c and 16d, we report results for all histograms in NYC and TKY, respectively. Note that runtime increases linearly with λ = (n − |L |) · K 2 · log((n − |L |) · K 2 ) (i.e., the linear regression models are good fit). This is in line with the time complexity analysis (see Section 4.1).

Target Resemblance
We evaluate the quality and runtime of RO and RH, as a function of (I) n, the length of original histogram, (II) N , the size (i.e., total frequency of locations) of the histogram, and (III) , the quality threshold. We additionally examine the To measure quality, we use JS-divergence, L 2 distance, and N CE. The results for the L 2 distance are in Appendix A.6, because they are similar to those for JS-divergence. Unless stated otherwise, the target histogram for a histogram H is a "uniform" histogram H , such that H has the same size, N , and length, n, as H, and each count of H is approximately equal to N n . Aiming to resemble a uniform histogram indicates a user with strong privacy requirements, since the uniform distribution has the maximum entropy (i.e., provides the least information about the frequencies in H to an attacker with no knowledge except N and n). Moreover, uniform target histograms are difficult to resemble, because the original histograms typically follow skewed distributions.

Quality and Privacy for the RO algorithm and the RH heuristic
Impact of length n To illustrate the impact of n on quality and privacy, we present results obtained for randomly selected histograms of varying n and N = 100. We do not report the median of all histograms of certain n, because the results followed skewed distributions (e.g., the runtime for histograms with n = 26 and N = 100 varied from 2.5 to 45 seconds).
We show that the privacy measure dp (JS-divergence) decreases with n, in Figures 17a and 17b. This is because the larger number of bins gives more choices to the algorithm to reduce dp without substantially increasing dq. In addition, Figures 17a and 17b show that RO and RH achieve very similar results: the dp values for RH were no more than 1% and 2.1% higher for NYC and TKY, respectively. This suggests that RH is an effective heuristic. We also show that the quality measure dq (JS-divergence) is not affected by n and, as expected, it does not exceed the threshold , in Figures 17c and 17d. RO finds solutions with larger dq than RH. This is because RH works in a greedy fashion. That is, the initial bins are sanitized heavily, which increases dq and does not leave much room for sanitizing the subsequent bins without exceeding .
In addition, we show the impact of n on N CE, in Figures 18a and 18b. Both algorithms achieve similar scores, because they aim to optimize dp with constraint dq ≤ and achieve similar results with respect to dq (see Figures 17c and 17d). The scores were zero (no quality loss), or low, which suggests that both RO and RH are able to preserve clustering quality. Impact of size N To illustrate the impact of N on quality and privacy, we present results obtained for randomly selected histograms of varying N with n = 25 for NYC. The results for TKY are qualitatively similar (omitted). We do not report the median of all histograms of certain N , because the results follow skewed distributions.
We show that the privacy measure dp (JS-divergence) increases with N , in Figure 18c. This is because there are more counts that need to change (increase or decrease) to minimize dp subject to dq ≤ . The results for RH are very close to those for RO; the dp scores for RH are no more than 1.8% larger. This suggests that RH is an effective heuristic.
We also show that the quality measure dq (JS-divergence) is not affected by N and that it does not exceed the threshold , in Figure 18d. Again, RO finds solutions with larger dq than RH. This is because, due to its greedy nature, RH sanitizes heavily the first bins, which increases dq and prevents the sanitization of subsequent bins without exceeding . In addition, we show the impact of N on N CE, in Figure 19. Both algorithms achieve similar scores, because they aim to optimize dp with constraint dq ≤ and achieve similar results with respect to dq (see Figure 18d). Their scores were low or zero. Thus, they are able to preserve clustering quality.
Impact of threshold To illustrate the impact of on quality and privacy, we present results obtained for a histogram with n = 40 and N = 100 in NYC. The results for TKY are similar (omitted).
We show that the privacy measure dp (JS-divergence) decreases with , in Figure 20a. This is because both RO and RH consider a larger space of possible solutions when is larger, and thus they are able to find a better solution with respect to dp. In addition, the results for RH and RO are very similar; the dp for RH is at most 2.4% (on average 0.5%) higher than that for RO. We also show that the quality measure dq (JS-divergence) for both RO and RH is close to , in Figure 20b. Again, the dq scores of RO are slightly larger than those of RH, because RH works in a greedy fashion (i.e., the first bins are sanitized heavily, which increases dq and does not leave much room for sanitizing the subsequent bins without exceeding ), as explained above.
In addition, we show that N CE increases with , in Figure 20c. This is because the algorithms trade-off quality for privacy when is larger (i.e., dq can be as high as ). The N CE scores for RO and RH are similar, with those for RH being lower (better) by 1.5% on average.
Recommendation quality We investigate the impact of sanitization on recommendation quality by using test datasets of original vs test datasets of histograms that were sanitized by applying RO or RH with different values of . Figures 21a and  21b show that M AE and RM SE are not substantially affected by sanitization, for all tested values. This suggests that recommendation quality is preserved fairly well. In some cases, the M AE and RM SE scores for the sanitized histograms were lower (better) than those for the original histograms. This is because the recommendation scores for these histograms approach their corresponding true location counts after sanitization.

Runtime performance for the RO algorithm and the RH heuristic
Impact of length n We show that the runtime of both RO and RH increases with n, in Figure 22a. This is because RO runs on a multipartite graph, G T R , with more layers n, and RH needs to consider more bins n. RH is at least two orders of magnitude more efficient than RO. For example, RH requires 5.2 milliseconds to sanitize a histogram with n = 26 while RO requires 2.3 seconds. Note that RH scales close to linearly with n, which shows that the efficiency of RH is better than what is predicted by the worst-case time complexity analysis in Section 4.3.
To further investigate the impact of length on runtime, we apply RO and RH to each histogram with length larger than 75 in NYC and TKY (see Figure 22b and 22c). There are 17 and 7 such histograms in NYC and TKY, respectively. These histograms are generally demanding to sanitize, because they also have large size (up to 2061). Again, we observe that RH is more efficient than RO by at least two orders of magnitude; RH needs on average 8.2 milliseconds, while RO needs on average 3.6 seconds. In these experiments, we use = 10 −5 . For larger values, the difference between the two algorithms increases, because RH scales better than RO with respect to , as explained above. Repeating the same experiment using the synthetic histograms (see Figure 22d and 23a), we find that both RO and RH scale well with n, and RH scales close to linearly with n. Impact of size N We show that the runtime of both RO and RH increases with N , in Figure 23b. This is because the multipartite graph G T R built by RO has more nodes O(N ·n) and thus more paths, and RH needs to consider more "moves" from source to destination bins. Again, RH is at least two orders of magnitude more efficient than RO. For example, RH required at most 6.7 milliseconds to sanitize a histogram with N = 109 while RO required 5.6 seconds.  Figure 23c. This is because, when is larger, the multipartite graph built by RO has more edges and thus more paths, and RH considers more "moves" from source to destination bins. RH is at least two orders of magnitude more efficient than RO, and it scales better with i.e., linearly versus quartically (proportionally to 4 ). This suggests that RH is a practical heuristic for large values, given that it produces solutions similar to those of RO.
Impact of target histogram H We show that the runtime of RH increases with the distance JS(H, H ), for different target histograms H , in Figure 23d. This is because RH has more choices (i.e., there are more ways to transfer the counts of a source bin to a destination bin in H, when H is further from H in terms of JS-divergence). In this experiment, we use = 0.5, because the runtimes with the default value are too small (few milliseconds) to obtain a meaningful result. We do not report the result for RO, because its runtime is not affected by the target histograms. The reason is that RO builds the same multipartite graph for each target histogram H , since H has the same length n and size N with the original histogram.

Related Work
This paper is at the intersection of location privacy and histogram privacy, which are discussed in Sections 6.1 and 6.2, respectively. We also discuss privacy-preserving recommendation in Section 6.3, as a potential application of our methods.

Location privacy
Research on location privacy focuses on (I) location-based services (LBS), or (II) location data publishing.
Research on LBS is mostly inspired from applications running on GPS-enabled mobile devices like smartphones and tablets -but also cars. Consequently, it addresses privacy for users who need to send data on the fly (as they move about), to a server that will provide them with some useful service (e.g. the location of the nearest restaurant). Privacy mechanisms in such scenarios need to make protection decisions on the fly, without knowing the future locations that the user will visit [9,20,61,62]. For example, [61] proposes a method for preventing the inference of locations that have been or will be visited by a user, based on what the user shares at any moment with a location-based service. Other recent research protects sensitive spatiotemporal location sequences [1]. As another example, [20] proposes a method that prevents an LBS server from aggregating the locations sent by a user into a histogram and then associating this histogram with the user. The method perturbs the user's locations one by one, before they are sent to the LBS server, by adding noise to them in order to enforce the privacy notion of geo-indistinguishability [11].
Research on location data publishing is inspired from the publication of large datasets, possibly as a database. Consequently, it addresses more static scenarios, in which the whole dataset to be protected is given to the protection algorithm as input [4, 12-14, 17, 54, 65, 66]. There are works showing the feasibility of attacks on pseudonymized data (i.e., data in which a user's identifying information is represented by a random id) [4], or on completely anonymized data (i.e., a sequence (e 1 , . . . , en), where the event e i = (l, t), i ∈ [1, n], represents a visit to location l at time t and is not associated to a specific user) [66]. For example, reference [66] shows how an attacker can use completely anonymized data to associate a user with their event subsequence (path). There are also works [13,14,17,54,65] which propose methods for anonymizing user-specific location data (i.e., a dataset where each record corresponds to a different user and contains a sequence of locations visited by the user and/or the time that these visits occurred). For example, reference [65] proposes algorithms for preventing the inference of a user's sensitive locations by an attacker knowing a subsequence of the user's locations. The algorithms of [65] use suppression (deletion) of locations and splitting of user sequences into carefully selected subsequences.
Yet, no research in location privacy has aimed to protect histograms of locations. The object/fact to be protected has been either a single location (in the LBS setting), or a (sub)sequence of locations (in the location data publishing setting). However, protecting single locations separately provides no guarantee about the effect on the histogram as a whole. It could happen that, e.g., each individual location is replaced with another location, so no single location is disclosed/compromised, but the histogram as a whole is very similar or even identical with the original one. Similarly, protecting location data could again lead to the same problem. It could happen that individual locations in a user's sequence are modified, but the histogram remains unprotected. Thus, works on LBS or location data publishing cannot be used as alternatives to our approach.

Histogram privacy
Research on histogram privacy is inspired from applications where a histogram is published as a statistical summary (approximation) of the distribution of an attribute in a (relational) dataset. For example, consider a dataset, where each record contains the zip-code of a different individual. The distribution of the zipcode attribute in the dataset can be represented with a histogram, where each bin is associated with a different zip-code value and the bin frequency (count) is the number of individuals in the dataset who live in the zip-code. Publishing such histograms is useful for performing count query answering and data mining tasks (e.g., clustering), but it may lead to the disclosure of sensitive information about individuals [2,18,31,34,56,76,84]. For instance, consider an adversary who knows the names of all three individuals, i 1 , i 2 , and i 3 , in a (non-released) dataset but the zip-codes of only i 1 and i 2 . When the published histogram contains the count of each zip-code in the dataset, the adversary can infer the zip-code of i 3 from the histogram. To prevent this type of disclosure, the frequencies in the histogram are perturbed, typically by noise addition, in order to satisfy differential privacy [19]. Informally, differential privacy ensures that the inferences that can be made by an adversary about an individual will be approximately independent of whether the individual's record is included in the dataset or not.
Several works have applied differential privacy to sanitize histograms [2,18,31,34,56,76,84]. A straightforward way to achieve this is by adding noise to the frequency of each bin of the histogram, according to the Laplace mechanism [19]. However, this procedure results in excessive utility loss [2]. Therefore, existing works [2,18,31,34,56,76,84] employ clustering to reduce the loss of utility, in three steps: (I) They cluster bins with similar frequencies together. (II) They apply the Laplace mechanism to the average (mean or median) of the frequencies in each cluster, to obtain a "noisy center" of the cluster. (III) They publish a histogram where each frequency bin in each cluster is replaced by the noisy center of its corresponding cluster. While clustering incurs some utility loss, it reduces the noise that is added by the Laplace mechanism, leading to better overall utility. Specifically, the works of [2,76] require each cluster to be formed of adjacent bins, while the work of [31] requires each cluster to have the same number of bins. Subsequent works [18,34,56,84] lift these restrictions to further improve utility. For example, reference [84] proposes a clustering framework, which can be instantiated by optimal or heuristic algorithms that trade-off the utility loss incurred by clustering with the utility loss incurred by the Laplace mechanism.
At a high level, our work is similar to the works in [2,18,31,34,56,76,84], in that it aims to protect a histogram (or it can be applied to each histogram in a dataset of histograms). However, it differs from the works in [2,18,31,34,56,76,84] along two dimensions: (I) It considers a histogram that represents the locations associated to a single user, instead of a histogram representing the values of many different individuals in an attribute of an underlying dataset. (II) It sanitizes a histogram by redistributing counts between bins, as specified by Problems 3.1, 3.2, and 3.3, instead of adding noise into the counts. Thus, the methods in [2,18,31,34,56,76,84] cannot be used to deal with the problems we consider. In fact, applying any of the methods in [2,18,31,34,56,76,84] to a histogram that represents the locations of a single user would simply prevent the inference of the exact frequencies (counts) of locations in the user's histogram. It would not protect against the disclosure of visits to sensitive locations (i.e., it cannot solve the SLH problem), nor against the disclosure of the fact that the histogram is similar/dissimilar to a target histogram (i.e., it cannot solve the T A/T R problem).
A different, less related class of works can be used to protect a histogram by making it indistinguishable within a set of histograms that is published [24,75]. These works differ from ours in their setting, in their privacy notion, or both. They differ in terms of setting because they consider a set of histograms (or more generally, vectors of frequencies [24,75]) rather than a single histogram with the location information of a single user. They differ in terms of privacy notion because they aim to prevent the disclosure of the identity of individuals, from the published set of histograms (i.e., the association of a histogram with identity information that is known to an attacker), rather than the inference of location information from a single histogram.

Privacy-preserving recommendation
There are several privacy-preserving recommendation methods. Most of them (e.g., [42,57]) assume there is a trusted server that applies privacy protection (e.g., anonymization) jointly to the data of many users. Unlike these methods, we assume a different setting, in which the user protects their histogram by themselves. Our setting is conceptually similar to the untrusted server setting [53,59,60], in which a user protects their data prior to disseminating them. Specifically, [59,60] propose methods in which a user applies differential privacy, while [53] proposes a method in which the user applies randomized perturbation. The privacy objective of these methods is to prevent the inference of exact user values. In contrast, we do not directly aim to prevent the inference of exact user values: our privacy notions are formalized by the SLH and T A/T R problems. Also, we do not require that the protected histograms will be used in the task of recommendation, although we experimentally show that the protected histograms that are produced by our approach allow preserving the accuracy of recommendation fairly well.

Conclusion
In this paper, we propose two new notions of histogram privacy, sensitive location hiding and target avoidance/resemblance, which lead to the following optimization problems: the Sensitive Location Hiding problem (SLH), which seeks to enforce the notion of sensitive location hiding with optimal quality, and the Target Avoidance/Resemblance (T A/T R) problem, which seeks to enforce target avoidance/resemblance with bounded quality loss. We also propose optimal algorithms for each problem, as well as an efficient heuristic for the T A/T R problem. Our experiments demonstrate that our methods are effective at preserving the distribution of locations in a histogram, as well as the quality of recommendations based on these locations, while being fairly efficient. j ∈ C i . In M CK, we are given a set of elements subdivided into m, mutually exclusive classes, C 1 , ..., Cm, and a knapsack. Each class C i has |C i | elements. Each element j ∈ C i has a cost c ij ≥ 0 and a weight w ij . The goal is to minimize the total cost (Eq. A.1) by filling the knapsack with one element from each class (constraint II), such that the weights of the elements in the knapsack satisfy the constraint I, where b ≥ 0 is a constant. The variable x ij takes a value 1, if the element j is chosen from class C i and 0 otherwise (constraint III).
We map a given instance I M CK to an instance I SLH of the special case of SLH in polynomial time, as follows: (I) Each class C i , i ∈ [1, m], is mapped to a location L i / ∈ L whose count f (L i ) in H is arbitrary. (II) A sensitive location L m+1 ∈ L (without loss of generality) is considered. The count of L m+1 in H is set to f (L m+1 ) = b. Thus, H = (f (L 1 ), . . . , f (Lm), b). (III) Each element x ij with weight w ij and cost c ij is mapped to an operation on H, which decreases f (L m+1 ) by w ij and increases f (L i ) by w ij (i.e., transfers w ij visits from L m+1 to L i ) and incurs q(H, H [i]) = c ij . If there are multiple operations such that q(H, H [i]) = c ij (e.g., when q is the L 1 distance), we select one arbitrarily. When x ij = 1, its corresponding operation is applied to H. The result of applying all operations on H is referred to as the sanitized histogram H .
We prove the correspondence between a solution S to I M CK and a solution H to I SLH , as follows: We first prove that, if S is a solution to I M CK , then H is a solution to I SLH . Since i∈ [1,m] j∈C i w ij ·x ij = b, f (L m+1 ) is decreased by b. Thus, H [m+1] = 0 (i.e., all visits to L m+1 were transferred to nonsensitive locations) and i∈ [1,m+1]  j∈C i c ij · x ij is minimum. Thus, S is a solution to I M CK .
Therefore, the special case of the SLH problem with |L | = 1 is weakly NP-hard, and, clearly, the SLH problem with |L | ≥ 1, is also weakly NP-hard.
A.2 Proof of weak NP-hardness for the T R problem We reduce the weakly NP-hard Multiple Choice Knapsack (M CK ≥ ) problem [32,63] to the T R problem. The M CK ≥ problem is defined as follows: subject to: (I) i∈ [1,n] j∈C i w ij · x ij ≥ b, (II) j∈C i x ij = 1, i = 1, . . . n, and (III) x ij ∈ {0, 1}, i = 1, . . . , n, j ∈ C i . In M CK ≥ , we are given a set of elements subdivided into n, mutually exclusive classes, C 1 , ..., Cn, and a knapsack. Each class C i has |C i | elements. Each element j ∈ C i has a cost c ij ≥ 0 and a weight w ij . The goal is to minimize the total cost (Eq. A.2) by filling the knapsack with one element from each class (constraint II), such that the weights of the elements in the knapsack satisfy the constraint I, where b ≥ 0 is a constant. The variable x ij takes a value 1, if the element j is chosen from class C i and 0 otherwise (constraint III).
We map a given instance I M CK ≥ to an instance I T R of T R in polynomial time, as follows: (I) Each class C i , i ∈ [1, n], is mapped to a location L i , which has an arbitrary count in H and a possibly different, arbitrary count in H . We prove the correspondence between a solution S to I M CK ≥ and a solution H to I T R , as follows: We first prove that, if S is a solution to I M CK ≥ , then H is a solution to I T R . Since i∈ [1,n] j∈C i c ij · x ij is minimum, i∈ [1,n] (max i∈ [1,n],j∈C i c ij ) · p(H [i], H ) is minimum. Thus, dp(H , H ) = i∈ [1,n] p(H [i], H ) is minimum. Since i∈[1,n] j∈C i w ij ·x ij ≥ b and b = max i∈ [1,n],j∈C i w ij · (n − ), it holds that i∈ [1,n] (1 − q(H, H[i] + k ij )) ≥ n − . This implies n − dq(H, H ) ≥ n − and dq(H, H ) ≤ . Therefore, H is a solution to I T R . We now prove that, if H is a solution to I T R , then S is a solution to I M CK ≥ . Since dp(H , H ) = i∈ [1,n]  and i∈ [1,n] Thus, S is a solution to I M CK ≥ . Therefore, T R is weakly NP-hard.

A.3 Proof of weak NP-hardness for the T A problem
We reduce the weakly NP-hard T R problem (see Appendix A.2) to the T A problem as follows.
We map a given instance I T R of the T R problem into an instance I T A of the T A problem in polynomial time, by mapping H and H to histograms H T A = H and H T A = H , respectively, defining dp(H T A , H T A ) = 1+1 dp(H ,H )+1 and dq(H T A , H T A ) = dq(H, H ), and setting T A = . Note that dp(H , H ) ≥ 0 by definition, so dp(H T A , H T A ) ∈ (0, 2].
We prove the correspondence between a solution H to I T R and a solution H T A to I T A , as follows: We first prove that, if H is a solution to I T R then H T A is a solution to I T A . Since H is a solution to I T R , dp(H , H ) is minimum. Thus, dp(H T A , H T A ) = 1+1 dp(H ,H )+1 is maximum. In addition, dq(H, H ) ≤ . Thus, dq(H T A , H T A ) = dq(H, H ) ≤ . Therefore, H T A is a solution to T A.
We now prove that, if H T A is a solution to I T A then H is a solution to I T R . Since H T A is a solution to I T A , dp(H T A , H T A ) is maximum. Thus, dp(H , H ) = Therefore, T A is weakly NP-hard.

A.4 Reduction from the T A to the T R problem
The T A problem can be reduced to the T R problem in polynomial time, as follows. Given an instance I T A of the T A problem, we can construct an instance I T R of the T R problem in polynomial time, by mapping H and H to histograms H T R = H and H T R = H , defining dp(H T R , H T R ) = This appendix provides results for the L 2 distance. The results in Figures 27a, 27b, 27c, and 27d are qualitatively similar to those in Figures 17a, 17b, 18c, and 20a, respectively.