1 Introduction

Society is every day more attached to technology. Threatens, virus, technological risks always have existed but with our technological dependence, security has becomes an increasing serious matter. In the last few years, online social networks have changed the dynamics of our day to day lives. Several studies show how people are most interested in maintain more friends and being well-liked, than keep his personal information restricted [7]. Millions of people no matter age, nationality, or education are part of online social networks and expose highly personal information about them in exchange for a service to communicate with their friends, family and colleagues. Companies like Facebook, Instagram, Twitter, or Tinder have sought to coordinate an attitudinal shift of privacy value.

Privacy in social networks is an open field of research; data is the source of decision makers and analysts. It has been shown how much personal information people disclose voluntary, like full names, photos, mobile numbers, address, etc. Many of them are unaware of risks and price associated to their personal information [39]. When using online social networks, it is quite important to understand and recognize the privacy risks involved [20]. The majority of people are unaware of the fact that their privacy has been endanger and they don’t do anything to protect themselves. For example, if someone posts personal information online, it is no longer private, and this could fall into the wrong hands. Even when it was posted with the highest possible security measures, some of the users’ acquaintances such as friends, colleagues and companies interacting with them, can expose their personal information. Even when there are no intentional purposes, it can be used to deliberate diminish their security or privacy. These users can become victims of identity theft, harassment, cyber bullying or illegal practices. The privacy paradox argues that individuals express security and privacy concerns about information sharing in online social network, but their behavior states the opposite to publish all kind of personal data [7]. Experts recommend evaluate our online social networks profile to pay close attention about the way each profile permit to protect personal information. Furthermore, take advantage of the enhanced privacy tools available to block personal targeting. In spite of privacy as a human right and necessary condition for the goods that are part of our well-being like freedom and security, it is important to pursue the goal of make better platforms for communicate but without exchange our privacy value. We have observed in recent years the implementation of technologies oriented to different approaches like anonymous communications, identity management, digital credentials, e-voting, privacy engineering, and others [16, 21, 22]. In this sense, privacy enhancing technologies aim to build mechanisms to provide tools to safely interact with technology keeping privacy’s users. This topic is also developed in k-anonymity [41], l-diversity [31] and t-closeness [34] methods. These are privacy preserving techniques which aim to protect databases scrambling and swapping values or adding noise to keep information usable. The challenge is to release data and maintain it interesting for analysis and statistical research [25]. In [23] exhibit the effect of combining k-anonymity with unlinkable systems like mix servers.

In this work we are interested in intersection attacks or traffic analysis attacks research. The family of statistical disclosure attacks belongs to it [13]. Combining a statistical disclosure attack with Analysis Social Networks (SNA) techniques, one adversary can be able to obtain important data from the whole network, topological network structure, subset of social data, revealing communities and its interactions [1]. For practical purposes the study of SNA is a field well known in different areas. Community detection recognizes groups of interest based on their behavior [29]. SNA may help to know who is a leader person in order to influence other users to shop specific products. Link prediction research infers new interactions in a social network based on analyzing several measures of their nodes [3, 20, 38]. Sociologists and history researchers want to know the correlation about political and social actors [2, 24]; epidemiologists study disease transmission and the influence of personal and social networks on health behavior; anthropologists measure the evolution of sociocultural systems, trying to understand what is going on, what went on before, and what the future prospects are [8, 20, 42, 45]. Collective inference techniques are used for online blog analysis in order to predict entity behavior through its connections. Using automatic learning techniques or natural language models it is desired to identify a text author by carrying out an analysis of his writing and the vocabulary used [4, 32]. In the ethological field SNA is used to study behavior in animals to learn about members of the same species [47]. These areas of study include SNA, but are not limited to economics, biology, anthropology, information science, social psychology, sociolinguistics, sociology, and so on.

However, there are still open problems associated when background knowledge is available to an attacker. Some of the results of identity disclosure in social network anonymous communications have been published in previous papers [36, 37]. The aim of this work is to extend and clarifies our intersection attack to divulge the structural properties of a social network. The foundation of this work has been presented in [37]. In this paper we have validated our algorithms distinguishing how many relationships can be inferred in an anonymous network when attacker gets partial information. We can obtain competitive advantages by using them and disclose important information in a real social network even when a mechanism of anonymization has been applied. In Section 2 we show the relevance of analysis social network and a state of art in statistical disclosure attack applied to real social networks. Also, it describes briefly several techniques used to mining data in social networks. The results and simulations are shown in Section 3, and the conclusions and future work are completed in Section 4.

2 Relevance of privacy in social network

2.1 Social network properties

A social network is a social structure made of individuals, which are connected by one or more types of relationships. Its representation can be made through a graph where the vertices represent individuals or entities and the edges the relations among them. Formally a simple social network is modeled as a graph G = (V,E) where: V = (v1,vn) is the set of vertices or nodes, represented as entities or individuals. E is the set of social relationships, represented as edges in the graph, where E = (vi,vj)|vi,vj𝜖V.

In literature exist three levels of analysis within the Social Network Analysis (SNA) [26, 40, 46]: i) analysis of egocentric networks; ii) analysis focused on subgroups of actors; iii) analysis focused on the overall structure of the network. The objective of the analysis of egocentric networks is to study how a behavior actor evolves, taking into account that is focus solely on that actor and his relationships with the rest of the participants. The second type of analysis allows understanding the logical of networks clustering and the existence of cooperation and competition patterns, which are adapted or maintained over time. Finally, in the analysis of overall structure of the network are considered the morphological characteristics adopted, the existence, role and subgroups interaction, the distribution of relationships between actors involved, the geodesic distance between actors, among others. According to the type of problem to solve some of the three levels of analysis is chosen.

The structural analysis of a social network is based on develop a matrix and a graph to represent the relationships among users. It is common to use and adjacency matrix M for a graph representation of n2 size, where n is the number of nodes. If there is an edge between node i and node j, 1 is placed in the cell (i,j) and 0 otherwise. Let’s imagine we want to examine friendship in a set of 5 people. Its representation is show in Table 1 with an adjacency matrix where 1 indicates the existence of friendship and 0 no relationship between user i and j. Figure 1 shows the same friend relationships through a directed graph composed of 5 nodes.

Table 1 Example of a representation of friendship
Fig. 1
figure 1

Example of a directed graph

The graph can also be classified according to various topological measures. In SNA is important to know if it is possible to reach a node through another node. In this case, it is interesting to identify how many ways exist and which one is the best. Paths are used to calculate distance between two nodes. Path is a set of nodes and different lines. The path length is the number of lines in it, where the first node is called the origin and the final destination. A shortest path between two nodes is the minimal length path of all the possible paths between nodes.

One of the most common paths is called geodesic path, which is the shortest path between two nodes. The length of a geodesic path is called geodesic distance and is denoted as d(i,j), which is the distance between the nodes ni and nj. Both directed and undirected graphs, the geodesic distance is the number of relationships in the shortest possible path from one actor to another. Distances are mainly used in some of the centrality measures. One of the main uses of graph theory in SNA is to recognize the most important nodes. Centrality measures at node level are node degree, nodal transitivity degree, betweenness and closeness. Measures related to the entire network such as density, diameter and clustering coefficient allow comparison of the whole network structure.

A network can be an extremely complex structure; the connections between nodes may have complicated patterns. One challenge at studying complex networks is to develop simple metrics that capture structural elements in an understandable form. One such simplification is to ignore any pattern between different nodes, and observe each node separately. Node degree in an undirected network is the number of its connections. By counting the number of nodes that each degree, it can be established the grade distribution Pdeg(k) defined as the percentage of nodes in the graph with degree k.

An example of the distribution of degrees of an undirected graph shown in Fig. 2.

Fig. 2
figure 2

Distribution degrees example

Where degrees are k1 = 1,k2 = 3,k3 = 1,k4 = 1,k5 = 2,k6 = 5,k7 = 3,k8 = 3,k9 = 2, and k10 = 1. The grade distribution is \(P_{deg} (1) = \frac {4}{10},P_{deg} (2) = \frac {2}{10},\) and \(P_{deg} (3) = \frac {3}{10}, P_{deg} (5) = \frac {1}{10}\)

Distribution degrees gives important clues within the structure of a network. For example, in the simplest types of networks, it is common to find most nodes in the network have similar degrees. Real-world networks usually have very different degrees distribution. In such networks, most nodes have a relatively small degree, but there are few nodes with a very high degree.

There are several works in the literature that suggest real-world social networks have very particular characteristics. Complex networks as www or social networks do not have an organized architecture, but rather have been promoted organized themselves according to the actions of many individuals. From these interactions global phenomenon, can emerge for example, properties of small world or free scale distribution. These two global properties have considerable implications for network behavior under attack, as well as dissemination of information or epidemiological issues. In late 1950, Erdos and Renyi [19] marked a precedent in classical mathematical theory to model problems of complex networks describing a network using a random graph, defining the foundations of the theory of random networks.

Networks composed of people connected through the exchange of emails exhibit characteristics of small world networks and scale-free networks. The “scale-free network” definition describes the kind of networks that exhibit a power-law distribution [6]. The characteristic of such networks is distribution of links results in a straight line if plotted on a logarithmic scale twice, as we can see in Fig. 3. The power law is a member of the family of distributions skewed toward the extremes, so describing events in which a random variable reaches high values infrequently, while medium or low values are much more common. Seen from another angle, the power law probability of occurrence of small events is relatively high, while the probability of occurrence of large events is relatively low.

Fig. 3
figure 3

Power law distribution

In literature have analyzed the structural properties of email networks, the results have concluded that traffic from a legitimate email system results in small-world networks and scale-free [33]. On the other hand, it is also argued that considering an email system as a single whole, does not display a scale-free behavior completely antisocial behavior as spam (Fig. 3).

One of the previous works about email networks consider the study of the structure of emails networks observing university log files [18]. Taking into account the network topologies of email address where emails are nodes and edges are the communications among them. The resulting network also shows a distribution of links or relationships with pronounced free scale and small-world behavior. We have contemplate the features of real social networks in order to achieve our goal to infer the most possible relationships in an email social network. There are several papers that review the evolution of different types of real networks [27]. Other works utilize communication patterns in the dataset Enron email to: detect social tensions [11]; discover structures within the organization [10]; identify the most relevant actors in the network over time [44]. A more detailed work studied more than 100 real-world networks to reveal clusters or communities, the authors note that large networks have a very different structure compared to the small-world networks [28]. And there is an inverse relationship between the size of the community and the high quality of the community. The largest networks of 100 nodes do not show good conductivity which can be translated as not having the ability to be a good community; the best communities are quite small, in the range of 10 to 100 nodes.

2.2 Privacy in social network

The privacy concerns about social networks analysis has been considered for several years. There is no doubt that social networks are increasing interest in the database, mining and theory communities. Hiding identities of a social network members in order to maintain its privacy, has still a lot of open problems. One of them is the use of background knowledge to map from individuals with known identities to anonymized nodes. In order to clarify how an attacker can take advantage of context information, it is important to distinguish between passive attacks and active attacks. In the first case, attackers just observe data flow, while in the second case, attackers actively manipulate nodes before anoymization to reveal identities network users [17]. There are techniques on privacy preservation in social networks focused on different ways such as edge modification [30], randomization for network structure [48], prevent identity disclosure [5], among others. Besides, there is not warranty for users to protect their data from operators. Social networks’ users require protection against malicious entities. It has been proposed architecture to protect personal information from the social network operators and other users [9].

The current problems of maintain privacy in social networks can be classified in three items: i) Each single method has been designed for a particular network; ii) There are a lot of network measures, so it is difficult to know if the network performance is optimal because there are not an standardized platform; iii) it must be considered the temporal information in order to obtain accurate results.

Traffic analysis is used to derive information from the patterns of a communication system. It has been shown that encryption by itself does not guarantee anony-mity. Even when the content of the communications is encrypted, the routing information must be clearly sent. These attacks get the most likely set of friends of a particular user, by carrying out the intersection of the anonymous sets receiving the messages user sends. One of the most widely used mechanisms for the protection of this type of attacks is the implementation of mixes.

In literature of statistical disclosure attacks, the hypotheses are overly demanding and unrealistic. For example, it were supposed scenarios in which messages had to be sent with uniform probability by all users, previous knowledge of the number of friends of a user or some network parameters, similar behaviors for all users like the average of messages sent or received. To our knowledge, it was the first time that a statistical disclosure attack was applied to email data or social networks data to detect relationships between users. The method presented in [35] leads to results in different dimensions: estimation of the number of messages sent by round or unit of time for each pair sender-receiver, ordering of the pairs from highest likelihood of communication to lowest, hard classification of pairs of users in communication-not communication. And, we have no restriction about the number of friends each user has [14, 15]. Later, a second version of the procedure, including a second pass on the data using the EM algorithm was presented in [36]. This improvement obtains better estimation of messages sent by users and detects which users really communicate. Each user i sends messages in each round to user j according to a Poisson distribution with rate λi,j. Users who do not communicate with each other will have a rate λi,j = 0. It has been shown a classification rate using three different methods: (i) uniform distribution, (ii) EM algorithm with Poisson distribution, (iii) EM algorithm with discrete tabulated distribution. One of the major results derived is the occasional detection of some pairs that have certainly communicate (without any doubt, based on combinatorial deduction) and the detection of some pairs that did never communicate in the time horizon of the attack. In this work we have applied the last method of EM algorithm with discrete tabulated distribution. In each step t, where t is the number of rounds, the following two steps are performed:

  1. 1.

    The first step is known as the Expectation step, it calculates the hope under a distribution Z conditioned to values of X and 𝜃. Where X are the values of the marginal and 𝜃 is the parameter vector of λ.

  2. 2.

    In the second step, known as the Maximization step, a new 𝜃 value is obtained. It was considered that attacker knows the number of messages sent and received by each user.

We observed e-mail data patterns are very specific. Details about a statistical disclosure attack to estimate the network and node characteristics of an anonymized email network such as power law coefficient, centrality and clustering measures, degree distribution and small-world-ness are described in [37]. The estimations allow identifying the evolution of the networks, evaluating differences between networks, and knowing who the most influential users are. This was an innovation, because there was not previous statistical disclosure attack utilized to estimate global network characteristics or node based measures.

3 Application of the method to the estimation of email network characteristics

It has been shown that an attacker can reveal the identities of mix users by analyzing network traffic, watching the flow of incoming and outgoing messages. An attacker can get partial information to study an anonymous social network, taking into account the vulnerability to attacks capture path [12, 43]. Such attacks using the vulnerability of the network traffic to compromise the identity of users to compromise the network.

We have applied our algorithm to data provided by the Data Center of the Universidad Complutense of Madrid which were previously anonymized. Such information is divided into 32 sub domains or faculties that composed the email system. We do not consider any further information like time-stamps or content email. We assume the attacker only gets traffic information, it means the number of messages each user sends and receives in every period of time, which we call a round. A batch is the number of messages sent and received in a round. In each round not all users participate, the sender set and the receiver set are not always the same. Only a small fraction of them are active, sending and receiving messages. Figure 1 represents a round with 5 users.

For demonstration purposes we have chosen only the Faculty A. Faculty A is a network composed of 85 users or nodes. In Table 2 we present the results obtained after applying our algorithm to Faculty A of 3 month data and Table 3 for 12 months data. We can see that estimations about smaller batches of messages are closer to the real values of the network. Since the information obtained consists only on the number of messages sent and received by the users in each round, the size of the rounds (batch size) is an important parameter that affects seriously the results. The first three rows show the estimated results for a batch size of 10, 30 and 50 messages. Faculty A has 85 users represented as nodes in social network, and edges are the relationships between them. The last row exhibits the real values of the network. So, we can notice smaller batches accomplish better estimations, batch sizes 10 and 30 calculate 406 edges. In Table 3 the result is similar, with a 10 batch size we have gotten better estimations of the social network real values.

Table 2 Results of Faculty A for 3 months observations
Table 3 Results of Faculty A for 12 months observations

In Figs. 4 and 5 we present the results obtained where: i) red edges represent the relationships disclosure among users in the anonymous network; ii) green edges are the links that our algorithm has not detected. We have placed the two overlapping graphs for three and twelve months, because of small differences. Figure 4 shows the estimated and real graph of Faculty A with a time horizon of three months. Figure 5 shows the results of the same Faculty A, but for a twelve months period. We also noted that both networks exhibit small world and scale-free characteristics.

Fig. 4
figure 4

Simulated vs. real graph of Faculty A for 3 months

Fig. 5
figure 5

Simulated vs. real graph of Faculty A for 12 months

We have utilized different batch sizes to estimate the most important nodes of each graph and we have gotten almost the same results. The incidents show that our algorithm is able to recognize who are the most influential nodes within a network despite increasing the number of nodes. The schema can be abstracted to other contexts; for example, repeated polls or elections in small populations, where the attack can be used to obtain an ordering of the likelihood users vote to some political groups or anti terrorist research, where the method can use phone calls information in repeated contexts to link senders with recipients.

Table 4 presents the five highest degrees of centrality calculated for Faculty A where the three first columns were estimated with a batch of 10, 30 and 50. The last column shows the five highest degrees of the real network. In Table 5 we show the five lowest centrality degrees. Also, the three first columns were estimated with size batches of 10, 30 and 50, and last column exhibit the five lowest centrality degrees of the real network. We can see how our algorithm estimates nodes more connected better.

Table 4 Five highest degree centrality nodes of Faculty A
Table 5 Five lowest degree centrality nodes of Faculty A

In Fig. 6 we present the comparison of estimated and actual degrees of the Faculty A for 3 to 12 months; the closer to the diagonal point is better estimate. Otherwise, the points are above or below the diagonal.

Fig. 6
figure 6

Simulated vs. real graph of Faculty A for 12 months

4 Conclusions and future work

Disclosing if there exists communication or not between a pair of users in a network communication system is the object of the attacker in the present work. In this paper we have described the characteristics and metrics of social networks. Using social network analysis techniques and getting several social network measures we were able to know user’s centrality to detect which are the most influential users in a network. We have applied an attack to disclosure identities on a university anonymous email system, representing such system as a social network. We showed that analysis of social networks helps to know the user’s centrality to detect the most important elements in the network. From the results we found the attack performs better with small batches and that estimated graph is very similar to the real one. For future work, we have considered social network data can be used to further investigate the performance of the strategy developed here. There are also other standard databases that could be used as benchmark (Enron email data, for example). Other applications in the field of disclosure of public data could be considered.