Abstract
With the advances of graph analytics, preserving privacy in publishing graph data becomes an important task. However, graph data is highly sensitive to structural changes. Perturbing graph data for achieving differential privacy inevitably leads to inject a large amount of noise and the utility of anonymized graphs is severely limited. In this paper, we propose a microaggregationbased framework for graph anonymization which meets the following requirements: (1) The topological structures of an original graph can be preserved at different levels of granularity; (2) \(\varepsilon \)differential privacy is guaranteed for an original graph through adding controlled perturbation to its edges (i.e., edge privacy); (3) The utility of graph data is enhanced by reducing the magnitude of noise needed to achieve \(\varepsilon \)differential privacy. Within the proposed framework, we further develop a simple yet effective microaggregation algorithm under a distance constraint. We have empirically verified the noise reduction and privacy guarantee of our proposed algorithm on three realworld graph datasets. The experiments show that our proposed framework can significantly reduce noise added to achieve \(\varepsilon \)differential privacy over graph data, and thus enhance the utility of anonymized graphs.
Keywords
 Privacypreserving graph data publishing
 Differential privacy
 Graph data utility
 dKgraphs
 Graph anonymization
Download conference paper PDF
1 Introduction
Graph data analysis has been widely performed in reallife applications. For instance, online social networks are explored to analyze human social relationships, election networks are studied to discover different opinions in a community, and coauthor networks are used to understand collaboration relationships among researchers [22]. However, such networks often contain sensitive or personally identifiable information, such as social contacts, personal opinions and private communication records. Publishing graph data can thus pose a privacy threat. To preserve graph data privacy, various anonymization techniques for graph data publishing have been proposed in the literature [1, 11, 14, 24]. Nonetheless, even when a graph is anonymized without publishing any identity information, an individual may still be revealed based on structural information of a graph [11].
In recent years, differential privacy [5] has emerged as a widely recognized mathematical framework for privacy. A number of studies [10, 18] have investigated the problem of publishing anonymized graphs under guarantee of differential privacy. However, graph data is highly sensitive to structural changes. Directly perturbing graph data often leads to inject a large amount of random noise and the utility of anonymized graphs is severely impacted. To deal with this issue, several works [19,20,21,22] have explored techniques of indirectly perturbing graph data through a graph abstraction model, such as the dKgraph model [16] and hierarchical random graph (HRG) model [2], or spectral graph methods. The central ideas behind these works are to first project a graph into a statistical representation (e.g., degree distribution and dendrogram), or a spectral representation (e.g., adjacency matrix), and then add random noise to perturb such representations. Although these techniques are promising, they can only achieve \(\varepsilon \)differential privacy over a graph by injecting the magnitude of random noise proportional to the sensitivity of queries, which is fixed to global sensitivity. Due to the high sensitivity of graph data on structural changes, the utility of anonymized graphs being published by these works is still limited.
To alleviate this limitation, we aim to develop a general framework of anonymizing graphs which meets the following requirements: (1) The topological structures of an original graph can be preserved at different levels of granularity; (2) \(\varepsilon \)differential privacy is guaranteed for an original graph through adding controlled perturbation to its edges (i.e., edge privacy [13]); (3) The utility of graph data is enhanced by reducing the magnitude of noise needed to achieve \(\varepsilon \)differential privacy. We observe that the dKgraph model [15, 16] for analyzing network topologies can serve as a good basis for generating structurepreserving anonymized graphs. Essentially, the dKgraph model generates dKgraphs by retaining a series of network topology properties being extracted from dsized subgraphs in an original graph. In order to reduce the amount of random noise without compromising \(\varepsilon \)differential privacy, we incorporate microaggregation techniques [4] into the dK graph model to reduce the sensitivity of queries. This enables to apply perturbation on network topology properties at a flexible level of granularity, depending on the degree of microaggregation.
Figure 1 provides a highlevel overview of our proposed framework. Given two neighboring graphs \(G \sim G^{\prime }\), network topology properties such as dKdistributions [16] are first extracted from each graph. Then a dKdistribution goes through a microaggregation procedure, which consists of partition and aggregation. After that, the microaggregated dKdistribution is perturbed, yielding a \(\varepsilon \)differentially private dKdistribution. Finally, based on the perturbed dKdistribution, \(\varepsilon \)differentially private dKgraphs are generated. That is, for two neighboring graphs \(G \sim G^{\prime }\), their corresponding anonymized graphs generated by this framework are \(\varepsilon \)indistinguishable.
Contributions. To summarize, our work has the following contributions: (1) We present a novel framework, called dKmicroaggregation, that can leverage a series of network topology properties to generate \(\varepsilon \)differentially private anonymized graphs. (2) We propose a distance constrained algorithm for approximating dKdistributions of a graph via microaggregation within the proposed framework, which enables us to reduce the amount of noise being added into \(\varepsilon \)deferentially private anonymized graphs. (3) We have empirically verified the noise reduction of our proposed framework on three realworld networks. It shows that our algorithm can effectively enhance the utility of generated anonymized graphs by providing better withincluster homogeneity and reducing the amount of noise, in comparison with the stateoftheart methods.
2 Problem Formulation
Let \(G=(V,E)\) be a simple undirected graph, where V is the set of nodes and E the set of edges in G. We use deg(v) to denote the degree of a node v, and deg(G) to denote the maximum degree of G.
Definition 1
(Neighboring graphs). Two graphs \(G = (V,E)\) and \(G^{\prime }= (V^{\prime }, E^{\prime })\) are said to be neighboring graphs, denoted as \(G \sim G^{\prime }\), iff \(V = V^{\prime }\), \(E \subset E^{\prime }\) and \(E+1 = E^{\prime }\).
The dKgraph model [16] provides a systematic way of extracting subgraph degree distributions from a given graph, i.e. dKdistributions.
Definition 2
(dKdistribution). A dKdistribution dK(G) over a graph G is the probability distribution on the connected subgraphs of size d in G.
Specifically, 1Kdistribution captures a degree distribution, 2Kdistribution captures a joint degree distribution, i.e. the number of edges between nodes of different degrees, and 3Kdistribution captures a clustering coefficient distribution, i.e. the number of triangles and wedges connecting nodes of different degrees. When \(d= V\), dKdistribution specifies the entire graph. For larger values of d, dKdistributions capture more complex properties of a graph at the expense of higher computational overhead [16]. To describe how a dKdistribution is extracted from a graph, we define the notion of dK function.
Definition 3
(dK function). Let \(\mathbb {G}=\{(V,E')E'\subseteq V\times V\}\) be the set of all graphs with the set V of nodes. A dK function \(\gamma ^{dK}: \mathbb {G}\rightarrow \mathbb {D}\) maps a graph in \(\mathbb {G}\) to its dKdistribution in \(\mathbb {D}\) s.t. \(\gamma ^{dK}(G)=dK(G)\).
Following the previous work [16], we define dKgraph as a graph that can be constructed through reproducing the corresponding dKdistribution.
Definition 4
(dKgraph). A dKgraph over dK(G) is a graph in which connected subgraphs of size d satisfy the probability distribution in dK(G).
Conceptually, a dKgraph is considered as an anonymized version of an original graph G that retains certain topological properties of G at a chosen level of granularity. In this paper, we aim to generate dKgraphs with \(\varepsilon \)differential privacy guarantee for preserving privacy of structural information between nodes of a graph (edge privacy). We formally define differentially private dKgraph below.
Definition 5
(Differentially private dKgraphs). A randomized mechanism \(\mathcal {K}\) provides \(\varepsilon \)differentially private dKgraphs, if for each pair of neighboring graphs \(G \sim G^{\prime }\) and all possible outputs \(\mathcal {G} \subseteq range(\mathcal {K})\), the following holds
\(\mathcal {G}\) is a family of dKgraphs, and \(\varepsilon > 0\) is the differential privacy parameter. Smaller values of \(\varepsilon \) provide stronger privacy guarantees [5].
3 dKMicroaggregation Framework
In this section, we present a novel framework dKMicroaggregation for generating \(\varepsilon \)differentially private dKgraphs. Without loss of generality, we will use 2Kdistribution to illustrate our proposed framework. This is due to two reasons: (1) As previously discussed in [15, 16], the \(d = 2\) case is sufficient for most practical purposes; (2) dKgenerators for \(d=2\) have been well studied [9, 15], whereas dKgenerators for \(d \ge 3\) have not been yet discovered [9]. Given a graph \(G=(V,E)\), we have \(2K(G)=\{(g,g',m) m=E_{(g,g')}\}\) where \((g,g')\) is a degree pair and \(E_{(g,g')}=\{(v,v')\in Eg=deg(v)\wedge g'=deg(v')\}\) is the set of edges with the degree pair \((g,g')\).
Previous studies [19, 20] have shown that, changing a single edge in a graph may result in one or more changes on tuples in its corresponding dKdistribution. The following lemma states the maximum number of changes between the 2Kdistributions of two neighboring graphs.
Lemma 1
Let \(G \sim G^{\prime }\) be two neighboring graphs. Then \(\gamma ^{dK}(G)\) and \(\gamma ^{dK}(G^{\prime })\) differ in at most \(4\times g +1\) tuples, where \(d=2\) and \(g=max(\{deg(G), deg(G')\})\).
In our work, for each dKdistribution D, we want to generate \(D_{\varepsilon }\) that is an anonymized version of D satisfying \(\varepsilon \)differential privacy. Thus, we view the response to a dK function \(\gamma ^{dK}\) for \(d=2\) as a collection of responses to degree queries, one for each tuple in a 2K distribution.
Definition 6
(Degree query). A degree query \(q_t: \mathbb {G}\rightarrow \mathbb {R}\) maps a degree pair \(t= (g_1, g_2)\) in a graph \(G\in \mathbb {G}\) to a frequency value in \(\mathbb {R}\) s.t. \((g_1, g_2, q_t(G))\in \gamma ^{dK}(G)\).
To guarantee \(\varepsilon \)differential privacy for each \(q_t\), we can add random noise into the real response \(q_t(G)\), yielding a randomized response \(q_t(G) + Lap(\varDelta (q_t)/\varepsilon )\), where \(\varDelta (q_t)\) denotes the sensitivity of \(q_t\) and \(Lap(\varDelta (q_t)/\varepsilon )\) denotes random noise drawn from a Laplace distribution.
If we query D with a set of degree queries \(\{q_t\}_{t\in D}\) and the response to each \(q_t\) satisfies \(\varepsilon \)differential privacy, by the parallel composition property of differential privacy [17], we can generate \(D_{\varepsilon }\) that satisfies \(\varepsilon \)differential privacy. However, the total amount of random noise being added into the responses can be very high, particularly when a graph is large. To control the amount of random noise and thus increase the utility of \(D_{\varepsilon }\), we microaggregate similar tuples in D before adding noise. Thus, the dK function \(\gamma ^{dK}\) is replaced by \(\gamma ^{dK} \circ \mathcal {M}\), i.e., we run \(\gamma ^{dK}\) on the microaggregated dKdistribution \(\overline{D}\) resulting from running a microaggregation algorithm \(\mathcal {M}\) over D. The response to \(\gamma ^{dK} \circ \mathcal {M}\) is a collection of responses to microaggregate degree queries, one for each cluster in \(\overline{D}\).
Definition 7
(Microaggregate degree query). A microaggregate degree query \(q^{*}_T: \mathbb {G}\rightarrow \mathbb {R}\) maps a set of degree pairs T in a graph \(G\in \mathbb {G}\) to a frequency value in \(\mathbb {R}\) s.t. \(q^{*}_T(G)= sum(\{q_t(G)t=(g_1,g_2), t\in T, (g_1,g_2, q_{t}(G))\in \gamma ^{dK}(G)\})\).
Indeed, we can see that \(q_t\) is a special case of \(q^*_T\) since \(q_t(G)=q^*_T(G)\) holds for \(T=\{t\}\). By Lemma 1, we have the following lemma about \(q_t\) and \(q^*_T\).
Lemma 2
The sensitivity of both \(q_t\) and \(q^*_T\) on a graph G is upper bounded by \((4\times deg(G) +1)\).
For each cluster in \(\overline{D}\) that is resulted from running \(\mathcal {M}\), only aggregated frequency value for a cluster of tuples is returned by a microaggregate degree query. Thus, \(\gamma ^{dK} \circ \mathcal {M}\) is less “sensitive” when the number of clusters in \(\overline{D}\) is smaller. By Lemma 2 and the fact that changing one edge on a graph may lead to changes on multiple clusters in \(\overline{D}\), we have the following lemma about the sensitivity of \(\gamma ^{dK} \circ \mathcal {M}\).
Lemma 3
Let \(C_1, \dots C_n\) be the clusters in \(\overline{D}\) resulting from running \(\mathcal {M}\) over \(\gamma ^{dK}(G)\). Then the sensitivity of \(\gamma ^{dK} \circ \mathcal {M}\) is upper bounded by \((4\times g +1)\times n\).
Generally, dKmicroaggregation works in the following steps. First, it extracts a dKdistribution from a graph. Then, it microaggregates the dKdistribution and perturbs the microaggregated dKdistribution to generate \(\varepsilon \)differentially private dKdistribution. Finally, a dKgraph is generated.
4 Proposed Algorithm
In this section, we discuss algorithms for microaggregating dKdistributions. Generally, a microaggregation algorithm for dKdistributions \(\mathcal {M}=(\mathcal {C}, \mathcal {A})\) consists of two phases: (a) Partition  similar tuples in a dKdistribution are partitioned into the same cluster; (b) Aggregation  the frequency values of tuples in the same cluster are aggregated. As illustrated in Fig. 2, a 2Kdistribution D is partitioned into multiple clusters by a clustering function \(\mathcal {C}\), i.e. \(\mathcal {C}(D) = D^{\prime }\). Then, the frequency values of tuples in each cluster are aggregated by an aggregate function \(\mathcal {A}\), i.e. \(\mathcal {A}(D^{\prime }) = \overline{D}\).
MDAVdK Algorithm. Given a dKdistribution \(D=\gamma ^{dK}(G)\) over a graph G, a simple way of microaggregating D is to partition D in such a way that each cluster contains at least k tuples. For this, we use a simple microaggregation heuristic, called Maximum Distance to Average Vector (MDAV) [4], which can generate clusters of the same size k, except one cluster of size between k and \(2k  1\). However, unlike a standard version of MDAV that aggregates each cluster by replacing each record in the cluster with a representative record, we perform aggregation to aggregate frequency values of tuples in each cluster into an aggregated frequency value. To avoid ambiguity, we call our MDAVbased algorithm for microaggregating dKdistributions the MDAVdK algorithm.
It is wellknown that, for many realworld networks such as Twitter, their degree distributions are often highly skewed. This often leads to highly skewed dKdistributions for such networks. However, due to inherent limitations of MDAV, e.g. the fixedsize constraint, MDAVdK would suffer significant information loss when evenly partitioning a highly skewed dKdistribution into clusters of the same size. To address this issue, we propose an algorithm called Maximum Pairwise Distance Constraint (MPDCdK).
MPDCdK Algorithm. Unlike MDAVdK, MPDCdK aims to partition a dKdistribution into clusters under a distance constraint. Specifically, after partitioning, the distances between the corresponding degrees in any two tuples within a cluster should be no greater than a specified distance interval \(\tau \). Take a 2Kdistribution D for example. Let \(t_1=(g_1,g'_1, m_1)\) and \(t_2=(g_2, g'_2, m_2)\) be two tuples in a cluster after applying MPDCdK on D. Then, we say that these two tuples satisfy a distance constraint \(\tau \) iff max(\(g_1g_2\), \(g'_1g'_2\)) \(\le \tau \). The clustering problem addressed by MPDCdK is thus to generate the minimum number of clusters in which every pair of tuples from the same cluster satisfies such a distance constraint \(\tau \).
The conceptual ideas behind our MPDCdK algorithm design is to consider each degree pair \((g, g^{\prime })\) as coordinates in a two dimensional space, and also treat the above distance constraint \(\tau \) as a \(\tau \)by\(\tau \) box, denoted by \(((x,x^{\prime }), \tau )\) and centered at \((x,x^{\prime })\), in the same two dimensional space. Clearly, such a box corresponds to a cluster that satisfies the distance constraint \(\tau \), and a box \(((x,x^{\prime }), \tau )\) covers a degree pair \((g, g^{\prime })\) iff \(x  \tau /2 \le g \le x + \tau /2\) and \(x^{\prime }  \tau /2 \le g^{\prime } \le x^{\prime } + \tau /2\). MPDCdK employs a greedy algorithm to find the minimum number of boxes (i.e., clusters) that cover all degree pairs. MDPCdK first enumerates all boxes that cover at least one degree pair and records the corresponding counts as the number of degree pairs being covered by these boxes. MDPCdK then recursively selects a box with the maximum count (i.e., covering the maximum number of degree pairs) in a greedy manner, assigns these degree pairs in a new cluster, and removes them from other boxes until all boxes are empty. After that, MDPCdK performs aggregation to aggregate the frequency values of tuples in each cluster into an aggregated frequency value.
Algorithm 1 describes the details of our MPDCdK algorithm. Given a dKdistribution D, we start with initializing an empty cluster list \(D^{\prime }\) (Line 1) and a list \(b{\_}list\) to record each box and its corresponding degree pairs, and the total number of degree pairs covered by the box (Line 2). For each degree pair \((g, g^{\prime })\) in D, we enumerate boxes that cover \((g, g^{\prime })\) using a function \(covering{\_}boxes\) (Line 4). For each enumerated box \(b_{i}\) we update the list by adding \((g, g^{\prime })\) to \(b_{i}\) and increment the count of \(b_{i}\) by 1 (Lines 5–6). After creating \(b{\_}list\), we iteratively select a box \(b_{max}\) with the maximum count for degree pairs (Line 8), then generate a new cluster of degree pairs in \(d_{max}\), and add it into the cluster list (Lines 9–10). We further remove \(b_{max}\) and all degree pairs in \(b_{max}\) from \(b{\_}list\) and update the counts of affected boxes in \(b{\_}list\) (Lines 11–15). The algorithm terminates when \(b{\_} list\) is empty and returns a set of generated clusters \(D^{\prime }\).
5 Theoretical Discussion
Privacy Analysis. Here, we theoretically show that dKgraphs generated in our proposed framework are differentially private. Firstly, by Lemma 2 and 3, we can obtain a \(\varepsilon \)differentially dKdistribution \(D_{\varepsilon }\) by microaggregating a dKdistribution and calibrating the amount of random noise according to the sensitivity of microaggregated degree queries. As \(D_{\varepsilon }\) only contains aggregated frequency values for clusters of tuples in a dKdistribution, we perform postprocessing using a randomized algorithm f to randomly select tuples within each cluster of \(D_{\varepsilon }\) until the aggregated frequency value of the cluster is reached. Previously, Dwork and Roth [6] proved that differential privacy is immune to postprocessing, i.e., the composition of a randomized algorithm with a differentially private algorithm is differentially private. This leads to the lemma below.
Lemma 4
Let \(D_{\varepsilon }\) be a \(\varepsilon \)differentially private dKdistribution and f be a randomized algorithm for postprocessing \(D_{\varepsilon }\). Then \(f(D_{\varepsilon })\) is also a \(\varepsilon \)differentially private dKdistribution.
Based on \(f(D_{\varepsilon })\), a dKgraph can be generated using a dKgraph generator [15, 16]. Following Lemma 4, Definition 5, and the proposition of Dwork and Roth [6] on postprocessing, we have the following theorem for our framework, which corresponds to a randomized mechanism \(\mathcal {K}=\gamma ^{dK}\circ \mathcal {M}\circ \mathcal {K}^{dK}\circ f\circ \widehat{\gamma }^{dK}\), where \(\widehat{\gamma }^{dK}:\mathbb {D}\rightarrow \mathbb {G}\) is a dKgraph generator.
Theorem 1
\(\mathcal {K}\) generates \(\varepsilon \)differentially private dKgraphs.
Complexity Analysis. We analyze the time complexity of the algorithms MDAVdK and MPDCdK. For MDAVdK with a constraint on the minimum size k of clusters, given a dKdistribution D as input, the complexity of MDAVdK for clustering is similar to MDAV [4], i.e. \(\mathcal {O}(n^{2})\). For MPDCdK with a constraint on the distance interval \(\tau \), in order to generate clusters, MPDCdK needs to perform a sequential search over all degree pairs in D. Firstly, MPDCdK needs to enumerate boxes for all the degree pairs, and each degree pair is covered by at most \((\tau +1)^2\) boxes (Line 4 of Algorithm 1), hence the cost of enumerating boxes is \(\mathcal {O}(\tau ^{2}n)\) (Line 3–6 of Algorithm 1). Secondly, MPDCdK sorts the boxes based on the corresponding degree pairs being covered, and selects and removes the box with the maximum count iteratively. Although it takes \(\mathcal {O}(n log n)\) to sort and greedily select the box with the maximum count for the first iteration, each later iteration only costs \(\mathcal {O}(\tau ^{2} log n)\) (Line 8 of Algorithm 1) because each box overlaps with at most \(4\tau ^{2}\) other boxes and removing one box only affects the count of \(\mathcal {O}(\tau ^{2})\) boxes (Lines 11–15 of Algorithm 1). Hence, the cost of selecting and removing boxes is \(\mathcal {O}(\tau ^{2} n log n)\) (Lines 7–15 of Algorithm 1). The overall complexity of MPDCdK for clustering is \(\mathcal {O}(\tau ^{2} n log n)\).
6 Experiments
We have evaluated the proposed framework to answer the following questions:

Q1. How does dKmicroaggregation reduce the amount of noise added into dKdistributions while still providing \(\varepsilon \)differential privacy guarantee?

Q2. How does our microaggregation algorithms perform in providing better within cluster homogeneity for dKdistributions?

Q3. What are the tradeoffs between utility and privacy when generating differentially private dKgraphs?
Datasets. We used three network datasets in the experiments: (1) polbooks^{Footnote 1} contains 105 nodes and 441 edges. It is a network of books about US politics. (2) caGrQc (see footnote 1) contains 5,242 nodes and 14,496 edges. (3) caHepTh (see footnote 1) contains 9,877 nodes and 25,998 edges. Both caGrQc and caHepTh are scientific collaborative networks between authors and papers.
Baseline Methods. In order to evaluate our proposed framework, we considered the following methods: (1) \(\varepsilon \)DP, which is a standard \(\varepsilon \)differential privacy algorithm in which noise is added using the Laplace mechanism [5]. (2) MDAVdK which extends the standard microaggregation algorithm MDAV [4] for handling dKdistributions. (3) MPDCdK is our proposed dKmicroaggregation algorithm. We used Orbis [15] to generate 2Kdistributions.
Evaluation Measures. We used Euclidean distance [19] to measure network structural error between original and perturbed dKdistributions. For clustering algorithms, we measure withincluster homogeneity using the sum of absolute error [7] defined as \(SAE = \sum _{i =1}^{N} \sum _{\forall x_j \in c_{i}} x_{j}  \mu _{i} \) where \(c_{i}\) is the set of tuples in cluster i and \(\mu _{i}\) is the mean of cluster i.
Experimental Results. To verify the overall utility of \(\varepsilon \)differentially private dKdistribution, we first conducted experiments to compare the structural error between original and perturbed dKdistributions generated by our algorithm MDAVdK, MPDCdK and the baseline method \(\varepsilon \)DP. Figure 3 presents our experimental results. For \(\varepsilon \)DP, we used the following privacy parameters \(\varepsilon = [0.01, 0.1, 1.0, 10.0]\), which cover the range of differential privacy levels widely used in the literature [12]. The results for \(\varepsilon \)DP is displayed as horizontal lines, as \(\varepsilon \)DP does not depend on the parameters k and \(\tau \).
From Fig. 3, we can see that, for all three datasets, our proposed algorithms MDAVdK and MPDCdK lead to less structural error for every value of \(\varepsilon \) as compared to \(\varepsilon \)DP. This is because, by approximating a query \(\gamma \) to \(\gamma \circ \mathcal {M}\) via dkmicroaggregation, the errors caused by random noise to achieve \(\varepsilon \)differential privacy are reduced significantly. Thus, dKmicroaggregation introduces overall less noise to achieve differential privacy.
We then conducted experiments to compare the quality of clusters, in terms of withincluster homogeneity, generated by MDAVdK and MPDCdK. The results are shown in Tables 1 and 2. We observe that, for values of k and \(\tau \) at which MDAVdK and MPDCdK generate almost the same number of clusters, as highlighted in bold, MPDCdK outperforms MDAVdK by producing clusters with less SAE over all three datasets. This is consistent with the previous discussion in Sect. 4. As MPDCdK always partitions degree pairs under a distance constraint rather than a fixedsize constraint, thus it generates more homogeneous clusters as compared to MDAVdK.
Discussion. We analyze the tradeoffs between utility and privacy of dKgraphs generated in the proposed framework. To enhance the utility of differentially private dKgraphs, we approximated an original query \(\gamma \) to \(\gamma \circ \mathcal {M}\). This thus introduces two kinds of errors: one is random noise to guarantee \(\varepsilon \)differential privacy, and the other one is due to microaggregation. We have noticed that, the second kind of error can be reduced by generating homogeneous clusters during microaggregation. On the other hand, for the first kind of error which depends on the sensitivity of \(\gamma \circ \mathcal {M}\), it dominates the impact on the utility of differentially private dKgraphs generated via dkmicroaggregation. By reducing sensitivity we can increase the utility of dKgraphs without compromising privacy.
7 Related Work
Graph data anonymization has been widely studied in the literature, and many anonymization techniques [1, 11, 14, 24] have been proposed to enforce privacy over graph data. These techniques can be broadly categorized into three areas: nodes and edges perturbation, kanonymity, and differential privacy. Perturbationbased approaches follow certain principles to process nodes and edges, including identity removal [14], edge modification [23], nodes clustering [11], and so on. Generally, kanonymity approaches divide an original graph into at least ksized blocks so that the probability that an adversary can reidentify one node’s identity is at most 1/k. Popular kanonymity approaches for graph anonymization include kcandidate [11], kneighborhood anonymity (kNA) [24], kdegree anonymity (kDA) [14], kautomorphism, and kisomorphism (kiso) [1].
Differential privacy on graph data can be roughly divided into two categories, namely: node differential privacy [3] and edge differential privacy [13]. In general, unlike kanonymity, differential privacy approaches have mathematical proofs of privacy guarantee. Nevertheless, applying differential privacy on graph data limits utility because graph is highly sensitive to structural changes and adding noise directly into graph data can significantly degrade its utility. To address this issue, many approaches [19,20,21,22] perturb various statistical information of a graph by projecting graph data into other domains using featureabstraction models [2, 16]. This idea is appealing; however it leads to yielding less data utility due to injecting random noise based on the global sensitivity to guarantee \(\varepsilon \)differential privacy. Our aim is to anonymize graphs under \(\varepsilon \)differential privacy using less sensitive queries. In this regard, we proposed a microaggregationbased framework which reduces the sensitivity via microaggregation, thus reducing the overall noise needed to achieve \(\varepsilon \)differentially private graphs.
8 Conclusion
In this paper, we have formalized a general microaggregationbased framework for anonymizing graphs that preserves the utility of dKgraphs while enforcing \(\varepsilon \)differential privacy. Based on the proposed framework, we have proposed an algorithm for microaggregating dKdistributions under a distance constraint. We have theoretically analyzed the privacy property of our framework and the complexity of our algorithm. The effectiveness of our work has been empirically verified over three realworld datasets. Future extensions to this work will consider zero knowledge privacy (ZKP) [8], to release statistics about social groups in a network while protecting privacy of individuals.
Notes
 1.
polbooks is available at http://networkrepository.com/polbooks.php; caGrQc and caHepTh are available at http://snap.stanford.edu/data/index.html.
References
Cheng, J., Fu, A.W.C., Liu, J.: Kisomorphism: privacy preserving network publication against structural attacks. In: SIGMOD, pp. 459–470 (2010)
Clauset, A., Moore, C., Newman, M.E.: Hierarchical structure and the prediction of missing links in networks. Nature 453(7191), 98–101 (2008)
Day, W.Y., Li, N., Lyu, M.: Publishing graph degree distribution with node differential privacy. In: SIGMOD, pp. 123–138 (2016)
DomingoFerrer, J., Torra, V.: Ordinal, continuous and heterogeneous kanonymity through microaggregation. Data Min. Knowl. Discov. 11(2), 195–212 (2005). https://doi.org/10.1007/s1061800500075
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. FnTTCS 9(3–4), 211–407 (2014)
EstivillCastro, V., Yang, J.: Fast and robust general purpose clustering algorithms. In: Mizoguchi, R., Slaney, J. (eds.) PRICAI 2000. LNCS (LNAI), vol. 1886, pp. 208–218. Springer, Heidelberg (2000). https://doi.org/10.1007/3540445331_24
Gehrke, J., Lui, E., Pass, R.: Towards privacy for social networks: a zeroknowledge based definition of privacy. In: Ishai, Y. (ed.) TCC 2011. LNCS, vol. 6597, pp. 432–449. Springer, Heidelberg (2011). https://doi.org/10.1007/9783642195716_26
Gjoka, M., Kurant, M., Markopoulou, A.: 2.5 kgraphs: from sampling to generation. In: INFOCOM, pp. 1968–1976 (2013)
Hay, M., Li, C., Miklau, G., Jensen, D.: Accurate estimation of the degree distribution of private networks. In: ICDM, pp. 169–178 (2009)
Hay, M., Miklau, G., Jensen, D., Towsley, D., Weis, P.: Resisting structural reidentification in anonymized social networks. In: PVLDB, pp. 102–114 (2008)
Iftikhar, M., Wang, Q., Lin, Y.: Publishing differentially private datasets via stable microaggregation. In: EDBT, pp. 662–665 (2019)
Jorgensen, Z., Yu, T., Cormode, G.: Publishing attributed social graphs with formal privacy guarantees. In: SIGMOD, pp. 107–122 (2016)
Liu, K., Terzi, E.: Towards identity anonymization on graphs. In: SIGMOD, pp. 93–106 (2008)
Mahadevan, P., Hubble, C., Krioukov, D., Huffaker, B., Vahdat, A.: Orbis: rescaling degree correlations to generate annotated internet topologies. In: SIGCOMM, pp. 325–336 (2007)
Mahadevan, P., Krioukov, D., Fall, K., Vahdat, A.: Systematic topology analysis and generation using degree correlations. In: SIGCOMM, pp. 135–146 (2006)
McSherry, F.D.: Privacy integrated queries: an extensible platform for privacypreserving data analysis. In: SIGMOD, pp. 19–30 (2009)
Proserpio, D., Goldberg, S., McSherry, F.: Calibrating data to sensitivity in private data analysis. In: PVLDB, pp. 637–648 (2014)
Sala, A., Zhao, X., Wilson, C., Zheng, H., Zhao, B.Y.: Sharing graphs using differentially private graph models. In: SIGCOMM, pp. 81–98 (2011)
Wang, Y., Wu, X.: Preserving differential privacy in degreecorrelation based graph generation. Trans. Data Priv. 6(2), 127–145 (2013)
Wang, Y., Wu, X., Wu, L.: Differential privacy preserving spectral graph analysis. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 329–340. Springer, Heidelberg (2013). https://doi.org/10.1007/9783642374562_28
Xiao, Q., Chen, R., Tan, K.L.: Differentially private network data release via structural inference. In: SIGKDD. pp. 911–920 (2014)
Ying, X., Wu, X.: Randomizing social networks: a spectrum preserving approach. In: SDM, pp. 739–750 (2008)
Zhou, B., Pei, J.: Preserving privacy in social networks against neighborhood attacks. In: ICDE, pp. 506–515 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Iftikhar, M., Wang, Q., Lin, Y. (2020). dKMicroaggregation: Anonymizing Graphs with Differential Privacy Guarantees. In: Lauw, H., Wong, RW., Ntoulas, A., Lim, EP., Ng, SK., Pan, S. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science(), vol 12085. Springer, Cham. https://doi.org/10.1007/9783030474362_15
Download citation
DOI: https://doi.org/10.1007/9783030474362_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030474355
Online ISBN: 9783030474362
eBook Packages: Computer ScienceComputer Science (R0)