Robust active attacks on social graphs

In order to prevent the disclosure of privacy-sensitive data, such as names and relations between users, social network graphs have to be anonymised before publication. Naive anonymisation of social network graphs often consists in deleting all identifying information of the users, while maintaining the original graph structure. Various types of attacks on naively anonymised graphs have been developed. Active attacks form a special type of such privacy attacks, in which the adversary enrols a number of fake users, often called sybils, to the social network, allowing the adversary to create unique structural patterns later used to re-identify the sybil nodes and other users after anonymisation. Several studies have shown that adding a small amount of noise to the published graph already suffices to mitigate such active attacks. Consequently, active attacks have been dubbed a negligible threat to privacy-preserving social graph publication. In this paper, we argue that these studies unveil shortcomings of specific attacks, rather than inherent problems of active attacks as a general strategy. In order to support this claim, we develop the notion of a robust active attack, which is an active attack that is resilient to small perturbations of the social network graph. We formulate the design of robust active attacks as an optimisation problem and we give definitions of robustness for different stages of the active attack strategy. Moreover, we introduce various heuristics to achieve these notions of robustness and experimentally show that the new robust attacks are considerably more resilient than the original ones, while remaining at the same level of feasibility.


Introduction
Data is useful.Science heavily relies on data to (in)validate hypotheses, discover new trends, tune up mathematical and computational models, etc.In other words, data col-1 arXiv:1811.10915v1[cs.SI] 27 Nov 2018 lection and analysis is helping to cure diseases, build more efficient and environmentallyfriendly buildings, take socially-responsible decisions, understand our needs and those of the planet where we live.Despite these indisputable benefits, it is also a fact that data contains personal and potentially sensitive information, and this is where privacy and usefulness should be considered as a whole.
A massive source of personal information is currently being handled by online social networks.People's life is often transparently reflected on popular social network platforms, such as Facebook, Twitter and Youtube.Therefore, releasing social network data for further study comes with a commitment to ensure that users remain anonymous.Anonymity, however, is remarkably hard to achieve.Even a simple social graph, where an account consists of a user's pseudonym only and its relation to other accounts, allows users to be re-identified by just considering the number of relations they have [13].
The use of pseudonyms is insufficient to guarantee anonymity.An attacker can crossreference information from other sources, such as the number of connections, to find out the real user behind a pseudonym.Taking into account the type of information an attacker may have, called background or prior knowledge, is thus a common practice in anonymisation models.In a social graph, the adversary's background knowledge is regarded as any subgraph that is isomorphic to a subgraph in the original social graph.Various works bound the adversary's background knowledge to a specific family of graphs.For example, Liu and Terzi assume that the adversary's background knowledge is fully defined by star graphs 1 [13].This models an adversary that knows the degrees of the victim vertices.Others assume that an adversary may know the subgraphs induced by the neighbours of their victims [37], and so on [38].
A rather different notion of background knowledge was introduced by Backstrom et al. [1].They describe an adversary able to register several (fake) accounts to the network, called sybil accounts.The sybil accounts establish links between themselves and also with the victims.Therefore, in Backstrom et al.'s attack to a social graph G = (V, E), the adversary's background knowledge is the induced subgraph formed by the sybil accounts in G joined with the connections to the victims.
The adversary introduced by Backstrom et al. [1] is said to be active, because he influences the structure of the social network.Previous authors have claimed that active attacks are either unfeasible or detectable.Such a claim is based on two observations.First, inserting many sybil nodes is hard, and they may be detected and removed by sybil detection techniques [21].Second, active attacks have been reported to suffer from low resilience, in the sense that the attacker's ability to successfully recover the sybil subgraph and re-identify the victims is easily lost after a relatively small number of (even arbitrary) changes are introduced in the network [11,17,18,19].As a consequence, active attacks have been largely overlooked in literature.Backstrom et al. argue for the feasibility of active attacks, showing that proportionally few sybil nodes (in the order of log n nodes for networks of order n) are sufficient for compromising any legitimate node.This feature of active attacks is relevant in view of the fact that sybil defence mechanisms do not attempt to remove every sybil node, but to limit their number to no more than log n [35,34], which entails that sufficiently capable sybil subgraphs are likely to go unnoticed by sybil defences.The second claim, that of lack of resiliency to noisy releases, is the main focus of this work.
Contributions.In this paper we show that active attacks do constitute a serious threat for privacy-preserving publication of social graphs.We do so by proposing the first active attack strategy that features two key properties.Firstly, it can effectively re-identify users with a small number of sybil accounts.Secondly, it is resilient, in the sense that it resists not only the introduction of reasonable amounts of noise in the network, but also the application of anonymisation algorithms specifically designed to counteract active attacks.The new attack strategy is based on new notions of robustness for the sybil subgraph and the set of fingerprints, as well as noise-tolerant algorithms for sybil subgraph retrieval and re-identification.The comparison of the robust active attack strategy to the original active attack is facilitated by the introduction of a novel framework of study, which views an active attack as an attacker-defender game.
The remainder of this paper is structured as follows.We enunciate our adversarial model in the form of an attacker-defender game in Section 2.Then, the new notions of robustness are introduced in Section 3, and their implementation is discussed in Section 4. Finally, we experimentally evaluate our proposal in Section 5, discuss related work in Section 6, and give our conclusions in Section 7.

Adversarial model
We design a game between an attacker A and a defender D. The goal of the attacker is to identify the victim nodes after pseudonymisation and transformation of the graph by the defender.We first introduce the necessary graph theoretical notation, and then formulate the three stages of the attacker-defender game.

Notation and terminology
We use the following standard notation and terminology.Additional notation that may be needed in other sections of the paper will be introduced as needed.
• A graph G is represented as a pair (V, E), where V is a set of vertices (also called nodes) and E ⊆ V × V is a set of edges.The vertices of G are denoted by V G and its edges by E G .As we will only consider undirected graphs, we will consider an edge (v, w) as an unordered pair.We will use the notation G for the set of all graphs.
• An isomorphism between two graphs G = (V, E) and ) ∈ E .Two graphs are isomorphic, denoted by G ϕ G , or briefly G G , if there exists an isomorphism ϕ between them.Given a subset of vertices S ⊆ V , we will often use ϕS to denote the set {ϕ(v)|v ∈ S}.
• The set of neighbours of a set of nodes • Let G = (V, E) be a graph and let S ⊆ V .The weakly-induced subgraph of S in G, denoted by S w G , is the subgraph of G with vertices S ∪ N G (S) and edges

The attacker-defender game
The attacker-defender game starts with a graph G = (V, E) representing a snapshot of a social network.The attacker knows a subset of the users, but not the connections between them.An example is shown in Figure 1(a), where capital letters represent the real identities of the users and dotted lines represent the relations existing between them, which are not known to the adversary.Before a pseudonymised graph is released, the attacker manages to enrol sybil accounts in the network and establish links with the victims, as depicted in Figure 1(b), where sybil accounts are represented by dark-coloured nodes and the edges known to the adversary (because they were created by her) are represented by solid lines.The goal of the attacker is to later re-identify the victims in order to learn information about them.The defender anonymises the social graph by removing the real user identities, or replacing them with pseudonyms, and possibly perturbing the graph.In Figure 1(c) we illustrate the pseudonymisation process of the graph in Figure 1(b).The pseudonymised graph contains information that the attacker wishes to know, such as the existence of relations between users, but the adversary cannot directly learn this information, as the identities of the vertices are hidden.Thus, after the pseudonymised graph is published, the attacker analyses the graph to first re-identify its own sybil accounts, and then the victims (see Figure 1(d)).This allows her to acquire new information, which was supposed to remain private, such as the fact that E and F are friends.
Next, we formalise the three stages of the attacker-defender game, assuming a graph G = (V, E).   of victim vertices Y = {y 1 , . . ., y m }, the attacker ensures that N G + (y i ) ∩ S = N G + (y j ) ∩ S for every y i , y j ∈ Y , i = j.

2.
Anonymisation.The defender obtains G + and constructs an isomorphism ϕ from G + to ϕG + .We call ϕG + the pseudonymised graph.The purpose of pseudonymisation is to remove all personally identifiable information from the vertices of G. Next, given a non-deterministic procedure t that maps graphs to graphs, known by both A and D, the defender applies transformation t to ϕG + , resulting in the transformed graph t(ϕG + ).The procedure t modifies ϕG + by adding and/or removing vertices and/or edges.
3. Re-identification.After obtaining t(ϕG + ), the attacker executes the re-identification attack in two stages.
(a) Attacker subgraph retrieval.Determine the isomorphism ϕ restricted to the domain of sybil nodes S.
As established by the last step of the attacker-defender game, we consider the adversary to succeed if she effectively determines the isomorphism ϕ restricted to the domain of victim nodes {y 1 , y 2 , . . ., y m }.That is, when the adversary re-identifies all victims in the anonymised graph.

Robust active attacks
This section formalises robust active attacks.We provide mathematical formulations, in the form of optimisation problems, of the attacker's goals in the first and third stages.In particular, we address three of the subtasks that need to be accomplished in these stages: fingerprint creation, attacker subgraph retrieval and fingerprint matching.

Robust fingerprint creation
Active attacks, in their original formulation [1], aimed at re-identifying victims in pseudonymised graphs.Consequently, the uniqueness of every fingerprint was sufficient to guarantee success with high probability, provided that the attacker subgraph is correctly retrieved.Moreover, as shown in [1], several types of randomly generated attacker subgraphs can indeed be correctly and efficiently retrieved, with high probability, after pseudonymisation.The low resilience reported for this approach when the pseudonymised graph is perturbed by applying an anonymisation method [17,18,19] or by introducing arbitrary changes [11], comes from the fact that it relies on finding exact matches between the fingerprints created by the attacker at the first stage of the attack and their images in t(ϕG + ).The attacker's ability to find such exact matches is lost even after a relatively small number of perturbations is introduced by t.
Our observation is that setting for the attacker the goal of obtaining the exact same fingerprints in the perturbed graph is not only too strong, but more importantly, not necessary.Instead, we argue that it is sufficient for the attacker to obtain a set of fingerprints that is close enough to the original set of fingerprints, for some notion of closeness.Given that a fingerprint is a set of vertices, we propose to use the cardinality of the symmetric difference of two sets to measure the distance between fingerprints.The symmetric difference between two sets X and Y , denoted by X Y , is the set of elements in X ∪ Y that are not in X ∩ Y .We use d(X, Y ) to denote |X Y |.
Our goal at this stage of the attack is to create a set of fingerprints satisfying the following property.Definition 1 (Robust set of fingerprints).Given a set of victims {y 1 , . . ., y m } and a set of sybil nodes S in a graph G + , the set of fingerprints {F 1 , . . ., F m } with The property above ensures that the lower bound on the distance between any pair of fingerprints is maximal.In what follows, we will refer to the lower bound defined by Equation (1) as minimum separation of a set of fingerprints.For example, in Figure 1(b), the fingerprint of the vertex E with respect to the set of attacker vertices {1, 2, 3} is {2, 3}, and the fingerprint of the vertex F is {1}.This gives a minimum separation between the two victim's fingerprints equal to |{2, 3} {1}| = |{1, 2, 3}| = 3, which is maximum.Therefore, given attacker vertices {1, 2, 3}, the set of fingerprints {{2, 3}, {1}} is robust for the set of victim nodes {E, F }.
Next we prove that, if the distance between each original fingerprint F and the corresponding anonymised fingerprint ϕF is less than half the minimum separation, then the distance between F and any other anonymised fingerprint, say ϕF , is strictly larger than half the minimum separation.
We exploit Theorem 1 later in the fingerprint matching step through the following corollary.If δ/2 is the maximum distance shift from an original fingerprint F i of y i to the fingerprint F i of y i in the perturbed graph, then for every F ∈ {F 1 , . . ., F m } it holds that d(F, ϕF i ) < δ/2 ⇐⇒ F = ϕF i .In other words, given a set of victims for which a set of fingerprints needs to be defined, the larger the minimum separation of these fingerprints, the larger the number of perturbations that can be tolerated in t(ϕG + ), while still being able to match the perturbed fingerprints to their correct counterparts in G + .

Robust attacker subgraph retrieval
be the set of all subgraphs isomorphic to the attacker subgraph S w G + and weakly induced in t(ϕG + ) by a vertex subset of cardinality |S|.The original active attack formulation [1] assumes that |C| = 1 and that the subgraph in C is the image of the attacker subgraph after pseudonymisation.This assumption, for example, holds on the pseudonymised graph in Figure 1(c).But it hardly holds on perturbed graphs as demonstrated in [11,17,18,19] .In fact, C becomes empty by simply adding an edge between any pair of attacker nodes, which makes the attack fail quickly when increasing the amount of perturbation.
To account for the occurrence of perturbations in releasing t(ϕG + ), we introduce the notion of robust attacker subgraph retrieval.Rather than limiting the retrieval process to finding an exact match of the original attacker subgraph, we consider that it is enough to find a sufficiently similar subgraph, thus adding some level of noise-tolerance.By "sufficiently similar", we mean a graph that minimises some graph dissimilarity measure ∆ : G × G → R + with respect to S w G + .The problem is formulated as follows.Definition 2 (Robust attacker subgraph retrieval problem).Given a graph dissimilarity measure ∆ : G ×G → R + , and a set S of sybil nodes in the graph G + , find a set S ⊆ V t(ϕG A number of graph (dis)similarity measures have been proposed in the literature [28,1,2,9,16].Commonly, the choice of a particular measure is ad hoc, and depends on the characteristics of the graphs being compared.In Subsection 4.2, we will describe a measure that is efficiently computable and exploits the known structure of S w G + , by separately accounting for inter-sybil and sybil-to-non-sybil edges.Along with this dissimilarity measure, we provide an algorithm for constructively finding a solution to the problem enunciated in Definition 2.

Robust fingerprint matching
As established by the attacker-defender game discussed in Section 2, fingerprint matching is the last stage of the active attack.Because it clearly relies on the success of the previous steps, we make the following two assumptions upfront.
1. We assume that the robust sybil subgraph retrieval procedure succeeds, i.e. that ϕS = S where S is the set of sybil nodes obtained in the previous step.
2. Given the original set of victims Y , we assume that the set of vertices in the neighbourhood of S contains those in ϕY , i.e. ϕY ⊆ N t(ϕG + ) (S ) \ S , otherwise S is insufficient information to achieve the goal of re-identifying all victim vertices.
Given a potentially correct set of sybil nodes S and a set of potential victims Y = {y 1 , . . ., y n } = N t(ϕG + ) (S ) \ S , the re-identification process consists in determining the isomorphism ϕ restricted to the vertices in Y .Next we define re-identification as an optimisation problem, and after that we provide sufficient conditions under which a solution leads to correct identification.Definition 3 (Robust re-identification problem).Let S and S be the set of sybil nodes in the original and anonymised graph, respectively.Let {y 1 , . . ., y m } be the victims in G + with fingerprints F 1 = N G + (y 1 )∩S, . . ., F m = N G + (y m )∩S.The robust re-identification problem consists in finding an isomorphism φ : S → S and subset {z 1 , . . ., z m } ⊆ N t(ϕG + ) (S ) \ S that minimises where .∞ stands for the infinity norm.
Optimising the infinity norm gives the lowest upper bound on the distance between an original fingerprint and the fingerprint of a vertex in the perturbed graph.This is useful towards the goal of correctly re-identifying all victims.However, should the adversary aims at re-identifying at least one victim with high probability, then other plausible objective functions can be used, such as the Euclidean norm.
As stated earlier, our intention is to exploit the result of Theorem 1, provided that the distance between original and perturbed fingerprints is lower than δ/2, where δ is the minimum separation of the original set of fingerprints.This is one out of three conditions that we prove sufficient to infer a correct mapping ϕ from a solution to the robust reidentification problem, as stated in the following result.
We use f −1 to denote the inverse of f .Then, considering that φF i = ϕF i for every i ∈ {1, . . ., m} (first condition), we obtain the following equalities.
Considering Theorem 1, we obtain that for every i, j ∈ {1, . . ., m} with i = j it holds that d(ϕF i , F j ) > δ/2.Therefore, if f is not the trivial automorphism, i.e.
However, this contradicts the optimality of the solution φ and {z 1 , . . ., z m }.Therefore, f must be the trivial automorphism, which concludes the proof.
In Theorem 2, the first condition states that the adversary succeeded on correctly identifying each of her own sybil nodes in the perturbed graph.That is to say, the adversary retrieved the mapping ϕ restricted to the set of victims.This is clearly an important milestone in the attack as victim's fingerprints are based on such mapping.The second condition says that the neighbours of the sybil vertices remained the same after perturbation.As a result, the adversary knows that {z 1 , . . ., z m } is the victim set in the perturbed graph, but she does not know yet the isomorphism ϕ restricted to the set of victims {y 1 , . . ., y m }.Lastly, the third condition states that δ/2 is an upper bound on the distance between a victim's fingerprints in the pseudonymised graph ϕG + and the perturbed graph t(ϕG + ), where δ is the the minimum separation between the victim's fingerprints.In other words, the transformation method did not perturbed a victim's fingerprint "too much".If those three conditions hold, Theorem 2 shows that the isomorphism ϕ restricted to the set of victims {y 1 , . . ., y m } is the trivial isomorphism onto {z 1 , . . ., z m }.
Summing up: in this section we have enunciated the three problem formulations for robust active attacks, namely: • Creating a robust set of fingerprints.
• Robustly retrieving the attacker subgraph in the perturbed graph.
• Robustly matching the original fingerprints to perturbed fingerprints.
Additionally, we have defined a set of conditions under which finding a solution for these problems guarantees a robust active attack to be successful.Each of the three enunciated problem has been stated as an optimisation task.Since obtaining exact solutions to these problems is computationally expensive, in the next section we introduce heuristics for finding approximate solutions.

Heuristics for an approximate instance of the robust active attack strategy
In this section we present the techniques for creating an instance of the robust active attack strategy described in the previous section.Since finding exact solutions to the optimisation problems in Equations ( 1), ( 2) and ( 3) is computationally expensive, we provide efficient approximate heuristics.

Attacker subgraph creation
For creating the internal links of the sybil subgraph, we will use the same strategy as the so-called walk-based attack [1], which is the most widely-studied instance of the original active attack strategy.By doing so, we make our new attack as (un)likely as the original to have the set of sybil nodes removed by sybil defences.Thus, for the set of sybil nodes S, the attack will set an arbitrary (but fixed) order among the elements of S. Let x 1 , x 2 , . . ., x |S| represent the vertices of S in that order.The attack will firstly create the path x 1 x 2 . . .x |S| , whereas the remaining inter-sybil edges are independently created with probability 0.5.
For creating the set of fingerprints, we will apply a greedy algorithm for maximising the minimum separation defined in Equation (1).The idea behind the algorithm is to arrange all possible fingerprints in a grid-like auxiliary graph, in such a way that nodes representing similar fingerprints are linked by an edge, and nodes representing well-separated fingerprints are not.Looking for a set of maximally separated fingerprints in this graph reduces to a well-known problem in graph theory, namely that of finding an independent set.An independent set I of a graph G is a subset of vertices from G such that E I G = ∅, that is, all vertices in I are pairwise not linked by edges.If the graph is constructed in such a way that every pair of fingerprints whose distance is less then or equal to some value i, then an independent set represents a set of fingerprints with a guaranteed minimum separation of at least i + 1.For example, the fingerprint graph shown in Figure 2 (a) represents the set of fingerprints {{1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}, with edges linking all pairs X, Y of fingerprints such that d(X, Y ) ≤ 1, whereas Figure 2 (b) represents an analogous graph where edges link all pairs X, Y of fingerprints such that d(X, Y ) ≤ 2. Note that the vertex set of both graphs is the power set of {1, 2, 3}, except for the empty set, which does not represent a valid fingerprint, as every victim must be linked to at least one sybil node.A set of fingerprints built from an independent set of the first graph may have minimum separation 2 (e.g.{{1}, {2}, {1, 2, 3}}) or 3 (e.g.{{1, 3}, {2}}), whereas a set of fingerprints built from an independent set of the second graph will have minimum separation 3 (the independent sets of this graph are {{1}, {2, 3}}, {{1, 2}, {3}} and {{1, 3}, {2}}).
Our fingerprint generation method iteratively creates increasingly denser fingerprint graphs.The vertex set of every graph is the set of possible fingerprints, i.e. all subsets of S except the empty set.In the i-th graph, every pair of nodes X, Y such that d(X, Y ) ≤ i will be linked by an edge.Thus, an independent set of this graph will be composed of nodes representing fingerprints whose minimum separation is at least i + 1.A maximal2 independent set of the fingerprint graph is computed in every iteration, to have an approximation of a maximum-cardinality set of uniformly distributed fingerprints with minimum separation at least i + 1.For example, in the graph of Figure 2 (a), the method will find {{1}, {2}, {3}, {1, 2, 3}} as a maximum-cardinality set of uniformly distributed fingerprints with minimum separation 2; whereas for the graph of Figure 2 (b), the method will find, for instance, {{1}, {2, 3}} as a maximum-cardinality set of uniformly distributed fingerprints with minimum separation 3. The method iterates until finding the smallest maximal independent set that still contains at least m fingerprints, that is, at least as many fingerprints as original victims, so a different fingerprint can be assigned to every victim.If this set does not contain exactly m fingerprints, it is used as a pool, from which successive runs of the attack randomly draw m fingerprints.Algorithm 1 lists the pseudo-code of this method.
In Algorithm 1, the order of every graph . Moreover, the greedy algorithm for finding a maximal independent set runs in quadratic time with respect to the order of the graph, so in this case its time complexity is also O 2 2|S| .Finally, since the maximum possible distance between a pair of fingerprints is |S|, the worst-case time complexity of Algorithm 1 is O |S| • 2 2|S| .This worst case occurs when the number of victims is very small, as the number of times that steps 4 to 9 of the algorithm are repeated is more likely to approach |S|.While this time complexity may appear as excessive at first glance, we must consider that, for a social graph of order n, the algorithm will be run for sets of sybil nodes having at most cardinality |S| = log 2 n.Thus, in terms of the order of the social Algorithm 1 Given a set S of sybil nodes, compute a uniformly distributed set of fingerprints of size at least m.
return V G graph, the worst-case running time will be O(n 2 log 2 n).

Attacker subgraph retrieval
As discussed in Subsection 3.2, in the original formulation of active attacks, the sybil retrieval phase is based on the assumption that the attacker subgraph can be uniquely and exactly matched to a subgraph of the released graph.This assumption is relaxed by the formulation of robust attacker subgraph retrieval given in Definition 2, which accounts for the possibility that the attacker subgraph has been perturbed.The problem formulation given in Definition 2 requires a dissimilarity measure ∆ to compare candidate subgraphs to the original attacker subgraph.We will introduce such a measure in this section.Moreover, the problem formulation requires searching the entire power set of V t(ϕG + ) , which is infeasible in practice.In order to reduce the size of the search space, we will establish a perturbation threshold ϑ, and the search procedure will discard any candidate subgraph X such that ∆( X w t(ϕG + ) , S w G + ) > ϑ.We now define the dissimilarity measure ∆ that will be used.To that end, some additional notation will be necessary.For a graph H, a vertex set V ⊆ V H , and a complete order ≺⊆ V × V , we will define the vector v ≺ = (v i 1 , v i 2 , . . ., v i |V | ), as the one satisfying When the order ≺ is fixed or clear from the context, we will simply refer to v ≺ as v.Moreover, for the sake of simplicity in presentation, we will in some cases abuse notation and use v for V , v w H for V w H , and so on.The search procedure assumes the existence of a fixed order ≺ S on the original set of sybil nodes S, which is established at the attacker subgraph creation stage, as discussed in Subsection 4. 1.In what follows, we will use the notation s = (x 1 , x 2 , . . ., x |S| ) for the vector s ≺ S .
Given the original attacker subgraph S w G + and a subgraph of t(ϕG + ) weakly induced by a candidate vector v = (v 1 , v 2 , . . ., v |S| ), the dissimilarity measure ∆ will compare v w t(ϕG + ) to S w G + according to the following criteria: • The set of inter-sybil edges of S w G will be compared to that of v w t(ϕG + ) .This is equivalent to comparing E( S G + ) and E( v t(ϕG + ) ).To that end, we will apply to S G + the isomorphism ϕ : S G + → v t(ϕG + ) , which makes ϕ (x i ) = v i for every i ∈ {1, . . ., |S|}.The contribution of inter-sybil edges to ∆ will thus be defined as that is, the symmetric difference between the edge sets of ϕ S G + and v t(ϕG + ) .
• The set of sybil-to-non-sybil edges of S w G + will be compared to that of v w t(ϕG + ) .Unlike the previous case, where the orders ≺ S and ≺ v allow to define a trivial isomorphism between the induced subgraphs, in this case creating the appropriate matching would be equivalent to solving the re-identification problem for every candidate subgraph, which is considerably inefficient.In consequence, we introduce a relaxed criterion, which is based on the numbers of non-sybil neighbours of every sybil node, which we refer to as marginal degrees.The marginal degree of a sybil node x ∈ S is thus defined as δ S w G + (x) = N S w G + (x) \ S .By analogy, for a vertex v ∈ v, we define Finally, the contribution of sybil-to-non-sybil edges to ∆ will be defined as • The dissimilarity measure combines the previous criteria as follows: Figure 3 shows an example of the computation of this dissimilarity measure, with s = (x 1 , x 2 , x 3 , x 4 , x 5 ) and v = (v 1 , v 2 , v 3 , v 4 , v 5 ).In the figure, we can observe that (x It is simple to see that the value of the dissimilarity function is dependent on the order imposed by the vector v.For example, consider the vector v = (v 5 , v 2 , v 3 , v 4 , v 1 ).We can verify 3 that now The search procedure assumes that the transformation t did not remove the image of any sybil node from ϕG + , so it searches the set of cardinality-|S| permutations of elements from V t(ϕG + ) , respecting the tolerance threshold.The method is a breadth-first search, which analyses at the i-th level the possible matches to the vector (x 1 , x 2 , . . ., x i ) composed of the first i components of s.The tolerance threshold ϑ is used to prune the search tree.A detailed description of the procedure is shown in Algorithm 2. Ideally, the algorithm outputs a unitary set C * = {(v j 1 , v j 2 , . . ., v j |S| )}, in which case the vector v = (v j 1 , v j 2 , . . ., v j |S| ) is used as the input to the fingerprint matching phase, described in the following subsection.If this is not the case, and the algorithm yields C * = {v 1 , v 2 , . . ., v t }, the attack randomly picks an element v i ∈ C * and proceeds to the fingerprint matching phase.Finally, if C * = ∅, the attack is considered to fail, as no re-identification is possible.To conclude, we point out that if Algorithm 2 is run with ϑ = 0, then C * contains exactly the same candidate set that would be recovered by the attacker subgraph retrieval phase of the original walk-based attack. 3We now have that ( . Moreover, now δ v w t(ϕG + ) Algorithm 2 Given the graphs G + and t(ϕG + ), the set of original sybil nodes S ⊆ V G + , and the maximum distance threshold ϑ, obtain the set C * of equally-likely best candidate sybil sets. 1: Find suitable candidates to match x 1 2: P artialCandidates  Find rest of matches for candidates 16: return Breadth-First-Search(2, P artialCandidates 1 ) Algorithm 3

Fingerprint matching
Now, we describe the noise-tolerant fingerprint matching process.Let Y = {y 1 , . . ., y m } ⊆ V G + represent the set of victims.Let S be the original set of sybil nodes and S ⊆ V t(ϕG + ) a candidate obtained by the sybil retrieval procedure described above.As in the previous subsection, let s = (x 1 , x 2 , . . ., x |S| ) be the vector containing the elements of S in the order imposed at the sybil subgraph creation stage.Moreover, let v w t(ϕG + ) , with v = (v 1 , v 2 , . . ., v |S| ) ∈ C * , be a candidate sybil subgraph, retrieved using Algorithm 2. Finally, for every i ∈ {1, . . ., m}, let F i ⊆ S be the original fingerprint of the victim y i and φF i ⊆ v its image by the isomorphism mapping s to v.
We now describe the process for finding Y = {y 1 , . . ., y m } ⊆ V t(ϕG + ) , where y i = ϕ(y i ), using φF 1 , φF 2 , . . ., φF m , s and v.If the perturbation t(G + ) had caused no damage to the fingerprints, checking for the exact matches is sufficient.Since, as previously discussed, this is usually not the case, we will introduce a noise-tolerant fingerprint matching strategy that maps every original fingerprint to its most similar candidate fingerprint, within some tolerance threshold β.
Algorithm 4 describes the process for finding the set of optimal re-identifications.For a candidate victim z ∈ N t(ϕG + ) (v) \ v, the algorithm denotes as Fz,v = N t(ϕG + ) (z) ∩ v its fingerprint with respect to v. The algorithm is a depth-first search procedure.First, the algorithm finds, for every φF i , i ∈ {1, . . ., m}, the set of most similar candidate fingerprints, Algorithm 3 Function Breadth-First-Search, used in Algorithm 2.
1: function Breadth-First-Search(i, P artialCandidates i−1 ) 2: Find suitable candidates to match (x 1 , x 2 , . . ., x i ) 3: s ← (x 1 , x 2 , . . ., x i ) 4: for ExtendedCandidates ← ∅ 8: ExtendedCandidates ← {v } 13: ExtendedCandidates ← ExtendedCandidates ∪ {v } Find rest of matches for candidates 27: return Breadth-First-Search(i + 1, P artialCandidates i ) and keeps the set of matches that reach the minimum distance.From these best matches, one or several partial re-identifications are obtained.The reason why more than one partial re-identification is obtained is that more than one candidate fingerprint may be equally similar to some φF i .For every partial re-identification, the method recursively finds the set of best completions and combines them to construct the final set of equally likely reidentifications.The search space is reduced by discarding insufficiently similar matches.
For any candidate victim z and any original victim y i such that d( Fz,v , φF i ) < β, the algorithm discards all matchings where ϕ(y i ) = z.
To illustrate how the method works, recall the graphs S w G + and v w t(ϕG + ) depicted in Figure 3.The original set of victims is Y = {y 1 , y 2 , y 3 , y 4 } and their fingerprints are , and F 4 = {x 3 }, respectively.In consequence, we have The method will first find all exact matchings, that is ϕ(y 2 ) = z 2 , ϕ(y 3 ) = z 3 , and ϕ(y 4 ) = z 4 , because the distances between the corresponding fingerprints is zero in all three cases.Since none of these matchings is ambiguous, the method next determines the match ϕ(y . At this point, the method stops and yields the unique re-identification {(y 1 , z 1 ), (y 2 , z 2 ), (y 3 , z 3 ), (y 4 , z 4 )}.Now, suppose that the vertex z 5 is linked in t(ϕG + ) to v 3 , instead of v 2 , as depicted in Figure 4.In this case, the method will unambiguously determine the matchings ϕ(y 2 ) = z 2 and ϕ(y 3 ) = z 3 , and then will try the two choices ϕ(y 4 ) = z 4 and ϕ(y 4 ) = z 5 .In the first case, the method will make ϕ(y 1 ) = z 1 and discard z 5 .Analogously, in the second case the method will also make ϕ(y 1 ) = z 1 , and will discard z 4 .Thus, the final result will consist in two equally likely re-identifications, namely {(y 1 , z 1 ), (y 2 , z 2 ), (y 3 , z 3 ), (y 4 , z 4 )} and {(y 1 , z 1 ), (y 2 , z 2 ), (y 3 , z 3 ), (y 4 , z 5 )}.Ideally, Algorithms 2 and 4 both yield unique solutions, in which case the sole element in the output of Algorithm 4 is given as the final re-identification.If this is not the case, the attack picks a random candidate sybil subgraph from C * , uses it as the input of Algorithm 4, and picks a random re-identification from its output.If either algorithm yields an empty solution, the attack fails.Finally, it is important to note that, if Algorithm 2 is run with ϑ = 0 and Algorithm 4 is run with β = 0, then the final result is exactly the same set of equally likely matchings that would be obtained by the original walk-based attack.
Algorithm 4 Given the graphs G + and t(ϕG + ), the original set of victims Y = {y 1 , y 2 , . . ., y m }, the original fingerprints F 1 , F 2 , . . ., F m , a candidate set of sybils v, and the maximum distance threshold β, obtain the set ReIdents of best matchings.
Find best matches of some unmapped victim(s) to one or more candidate victims Recursively find best completions for partial re-identifications

Experiments
The purpose of our experiments is to show the considerable gain in resiliency of the proposed robust active attack, in comparison to the original attack.We used for our experiments a collection of randomly generated synthetic graphs.This collection is composed of 10, 000 graphs for each density value in the set {0.05, 0.1, . . ., 1}.Each graph has 200 vertices4 , and its edge set is randomly generated in such a manner that the desired density value is met.
As discussed in [1], for a graph having n vertices, it suffices to insert log n sybil nodes for being able to compromise any possible victim, while it has been shown in [35,34] that even the so-called near-optimal sybil defences do not aim to remove every sybil node, but to limit their number to around log n.In light of these two considerations, when evaluating each attack strategy on the collection of randomly generated graphs, we use 8 sybil nodes.Given the set S of sybil nodes, the original attack randomly creates any set of fingerprints from P(S) \ ∅.In the case of the robust attack, the pool of uniformly distributed fingerprints was generated in advance and, if it is larger than the number of victims, different sets of fingerprints of size |Y | are randomly drawn from the pool at every particular run.Moreover, in Algorithms 2 and 4, we made ϑ = β = 8.
For every graph in the collection, after simulating the attacker subgraph creation stage of each attack, and the pseudonymisation performed by the defender, we generate six variants of perturbed graphs using the following methods: (a) The method proposed in [19] for enforcing (k, )-anonymity for some k > 1 or some > 1.
(c) The method proposed in [18] for enforcing (k, Γ G,1 )-adjacency anonymity for a given value of k.Here, we will run the method with k = |S|, since it was empirically shown in [18] that the original walk-based attack is very likely to be thwarted in this case.(e) Randomly flipping 5% of the edges in G + (that is, 1076 flips), in a manner analogous to the one used above.
(f) Randomly flipping 10% of the edges in G + (that is, 2153 flips), in a manner analogous to the one used above.
Finally, we compute the probability of success of the re-identification stage for each perturbed variant.The success probability is computed as where X is the set of equally-likely possible sybil subgraphs retrieved in t(ϕG + ) by the third phase of the attack5 , and with Y X containing all equally-likely matchings6 according to X.These experiments7 were performed on the HPC platform of the University of Luxembourg [31].
Figure 5 shows the success probabilities of the two attacks on the perturbed graphs obtained by applying the strategies (a) to (f).From the analysis of the figure, it is clear that the robust attack has a consistently larger probability of success than the original walk-based attack.A relevant fact evidenced by the figure (items (a), (b) and (c)) is that the robust attack is completely effective against the anonymisation methods described in [19,18].As pointed out by the authors of these papers, the fact that the original walkbased attack leveraging more than one sybil node was thwarted to a considerable extent by the anonymisation methods was a side effect of the disruptions caused in the graph, rather than the enforced privacy measure itself.By successfully addressing this shortcoming, the robust active attack lends itself as a more appropriate benchmark on which to evaluate future anonymisation methods.For example, by comparing the charts in Figure 5 (a) and (b), we can see that the success probability of the original attack, for some low density values, is slightly higher for the second anonymisation method than for the first.Neither of these algorithms gives a theoretical privacy guarantee against an attacker leveraging seven sybil nodes.However, a difference in success probability is observed for the original attack, which is due to the fact that the second method introduces a smaller number of changes in the graph than the first one [18].As can be observed in the figure, the robust attack is not affected by this difference between the methods, and allows the analyst to reach the correct conclusion that both methods fail to thwart the attack.As a final observation, by analysing items (d) and (e) in Figure 5, we can see that even 1% of random noise completely thwarts the original attack, whereas the robust attack still performs at around 0.6 success probability.The robust attack also performs acceptably well with a 5% random perturbation.

Related work
Privacy attacks on social networks exploit structural knowledge about the victims for reidentifying them in a released version of the social graph.These attacks can be divided in two categories, according to the manner in which the adversary obtains the knowledge used to re-identify the victims.On the one hand, passive attacks rely on existing knowledge, which can be collected from publicly available sources, such as the public view of another social network where the victims are known to have accounts.The use of this type of information was demonstrated in [21], where information from Flickr was used to re-identify users in a pseudonymised Twitter graph.
On the other hand, active attacks rely on the ability to alter the structure of the social graph, in such a way that the unique structural properties allowing to re-identify the victims after publication are guaranteed to hold, and to be known by the adversary.As we discussed previously, the active attack methodology was introduced by Backstrom et al. in [1].They proposed to use sybil nodes to create re-identifiable patterns for the victims, in the form of fingerprints defined by sybil-to-victim edges.Under this strategy, they proposed two attacks, the walk-based attack and the cut-based attack.The difference between both attacks lies in the structure given to the sybil subgraph for facilitating its retrieval after publication.In the walk-based attack, a long path linking all the sybil nodes in a predefined order is created, with remaining inter-sybil edges randomly generated.In the cut-based attack, a subset of the sybil nodes are guaranteed to be the only cut vertices linking the sybil subgraph and the rest of the graph.Interestingly, Backstrom et al. also study a passive version of these attacks, where fingerprints are used as identifying information, but no sybil nodes are inserted.Instead, they model the situation where legitimate users turn rogue and collude to share their neighbourhood information in order to retrieve their own weakly induced subgraph and re-identify some of their remaining neighbours.However, the final conclusion of this study is that the active attack is more capable because sybil nodes can better guarantee to create a uniquely retrievable subgraph and unique fingerprints.Additionally, a hybrid strategy was proposed by Peng et al. [22,23].This hybrid attack is composed of two stages.First, a small-scale active attack is used to re-identify an initial set of victims, and then a passive attack is used to iteratively enlarge the set of re-identified victims with neighbours of previously re-identified victims.Because of the order in which the active and the passive phases are executed, the success of the initial active attack is critical to the entire attack.
A large number of anonymisation methods have been proposed for privacy-preserving publication of social graphs, which can be divided into three categories: those that produce a perturbed version of the original graph [13,37,38,5,14,3,6,32,15,26,24,4], those that generate a new synthetic graph sharing some statistical properties with the original graph [10,20,12], and those that output some aggregate statistic of the graph without releasing the graph itself, e.g.differentially private degree correlation statistics [25].Active attacks, both the original formulation and the robust version presented in this paper, are relevant to the first type of releases.In this context, a number of methods have been proposed aiming to transform the graph into a new one satisfying some anonymity property based on the notion of k-anonymity [27,29].Examples of this type of anonymity properties for passive attacks are k-degree anonymity [13], k-neighbourhood anonymity [37] and kautomorphism [38].For the case of active attacks, the notions of (k, )-anonymity was introduced by Trujillo and Yero in [30].A (k, )-anonymous graph guarantees that an active attacker with the ability to insert up to sybil nodes in the network will still be unable to distinguish any user from at least other k − 1 users, in terms of their distances to the sybil nodes.Several relaxations of the notion of (k, )-anonymity were introduced in in [18].The notion of (k, )-adjacency anonymity accounts for the unlikelihood of the adversary to know all distances in the original graph, whereas (k, Γ G, )-anonymity models the protection of the victims only from vertex subsets with a sufficiently high re-identification probability and (k, Γ G, )-adjacency anonymity combines both criteria.
Anonymisation methods based on the notions of (k, )-anonymity, (k, Γ G, )-anonymity and (k, Γ G, )-adjacency anonymity were introduced in [19,17,18].As we discussed above, despite the fact that these methods only give a theoretical privacy guarantee against adversaries with the capability of introducing a small number of sybil nodes, empirical results show that they are in fact capable of thwarting attacks leveraging larger numbers of sybil nodes.These results are in line with the observation that random perturbations also thwart active attacks in their original formulation [21,11].This is a result of the fact that, originally, active attacks attempt to exactly retrieve the sybil subgraph and match the fingerprints.
In the context of obfuscation methods, which aim to publish a new version of the social graph with randomly added perturbations, Xue et al. [33] assess the possibility of the attacker leveraging the knowledge about the noise generation to launch what they call a probabilistic attack.In their work, Xue et al. provided accurate estimators for several graph parameters in the noisy graphs, to support the claim that useful computations can be conducted on the graphs after adding noise.Among these estimators, they included one for the degree sequence of the graph.Then, noting that an active attacker can indeed profit from this estimator to strengthen the walk-based attack, they show that after increasing the perturbation by a sufficiently small amount this attack also fails.Although the probabilistic attack presented in [33] features some limited level of noise resilience, it is not usable as a general strategy, because it requires the noise to follow a specific distribution and the parameters of this distribution to be known by the adversary.Our definition of robust attack makes no assumptions about the type of perturbation applied to the graph.
Finally, we point out that the active attack strategy shares some similarities with graph watermarking methods, e.g.[7,36,8].The purpose of graph watermarking is to release a graph containing embedded instances of a small subgraph, the watermark, that can be easily retrieved by the graph publisher, while remaining imperceptible to others and being hard to remove or distort.Note that the goals of the graph owner and the adversary are to some extent inverted in graph watermarking, with respect to active attacks.Moreover, since the graph owner knows the entire graph, he can profit from this knowledge for building the watermark.During the sybil subgraph creation phase of an active attack, only a partial view of the social graph is available to the attacker.

Conclusions
In this study, we have the capabilities of active attackers in the setting of privacy-preserving publication of social graphs.In particular, we have given definitions of robustness for different stages of the active attack strategy and have shown, both theoretically and empirically, scenarios under which these notions of robustness lead to considerably more successful attacks.One particular criticism found in the literature, that of active attacks lacking resiliency even to a small number of changes in the network, has been shown in this paper not to be an inherent problem of the active attack strategy itself, but rather of specific instances of it.In light of the results presented here, we argue that active attacks should receive more attention by the privacy-preserving social graph publication community.In particular, existing privacy measures and anonymisation algorithms should be revised, and new ones should be devised, to account for the capabilities of robust active attackers.
sybil-extended graph of G.The attacker does not know the complete graph G + , but he knows S w G + , the weakly-induced subgraph of S in G + .We say that S w G + is the attacker subgraph.The attacker subgraph creation has two substages: (a) Creation of inter-sybil connections.A unique (with high probability) and efficiently retrievable connection pattern is created between sybil nodes to facilitate the attacker's task of retrieving the sybil subgraph at the final stage.(b) Fingerprint creation.For a given victim vertex y ∈ N G + (S) \ S, we call the victim's neighbours in S, i.e N G + (y) ∩ S, its fingerprint.Considering the set The sybil nodes are added and fingerprints are created for the victims.
The attacker subgraph is recovered and the victims are re-identified.

Figure 3 :
Figure 3: An example of possible graphs S w G + and v w t(ϕG + ) .Vertices in S and v are coloured in gray. then

Figure 4 :
Figure 4: Alternative example of possible graphs S w G + and v w t(ϕG + ) .

( d ) 2 =
Randomly flipping 1% of the edges in G + .Each flip consists in randomly selecting a pair of vertices u, v ∈ V G + , removing the edge (u, v) if it belongs to E G + , or adding it in the opposite case.Since G + has order n = 208, this perturbation performs 0.01 • n(n−1) 215 flips.