Complex Networks VII pp 107-118 | Cite as
Analysis of the Temporal and Structural Features of Threads in a Mailing-List
- 2 Citations
- 764 Downloads
Abstract
A link stream is a collection of triplets (t, u, v) indicating that an interaction occurred between u and v at time t. Link streams model many real-world situations like email exchanges between individuals, connections between devices, and others. Much work is currently devoted to the generalization of classical graph and network concepts to link streams. In this paper, we generalize the existing notions of intra-community density and inter-community density. We focus on emails exchanges in the Debian mailing-list and show that threads of emails, like communities in graphs, are dense subsets loosely connected from a link stream perspective.
Keywords
Mailing Lists Stream Link Email Exchanges Inter-contact Time Substreams1 Introduction
Exchanges in a mailing-list are often studied as complex networks: there is a link between two individuals if they exchange emails. In particular, communities in such complex networks capture groups of friends or close colleagues (individuals that exchange many more messages within the group than outside the group, typically) [2]. However, removing all time information has important consequences if one wants to study the dynamics of email exchanges.
In order to study those dynamics, one may label each link with the frequency of exchanges or the times at which they occur [5], but capturing both the structure and the dynamics of exchanges remains challenging. In particular, studying threads calls for methods that capture the temporal nature of interactions more accurately, without loosing the power of network analysis.
An example of link stream representing email exchanges between individuals a, b, c, d and e, with threads represented by colored areas. For instance, at time 5, b and c exchange an email, as well as d and e. Threads are a priori dense series of exchanges involving a limited group of nodes during a limited period of time
2 Dataset
Archives of exchanges in various mailing-lists are readily available on the web, and studying them provides very rich insights on various issues. They have the advantage of being publicly available in many cases, and some involve large amounts of users over long time periods.
A typical example is provided by Debian mailing-list [4]: it contains emails sent from over 51753 email addresses, over almost 20 years. In addition, exchanges in this mailing-list have been studied in the past [1, 3, 7]. Finally, this dataset provides the thread information for each message, that we can use as a ground truth. For all these reasons, we use in this paper the Debian mailing-list to illustrate and validate our approach.
More precisely, we crawled the Debian mailing-list web archive [4]. For each message m, we extract its author a(m), the date t(m) at which it was posted (converted into UTC time), and the message it is replying to p(m) (through the in-reply-to entry), which has a corresponding author a(p(m)). This corresponds to an interaction between a(m) and a(p(m)) at time t(m) in the link stream. Some messages are not answers to any other message (they are directly sent to the mailing-list), and in this case we state that \(p(m)=m\). Such messages are called root messages.
We capture the mailing-list from January 1st, 1996 to December 31st, 2014. We obtain a dataset \(\mathscr {D}\) of \(n=722716\) emails sent from 51753 distinct email addresses.
Each root message m naturally induces a thread: it is the set \(\mathscr {T}(m)\) of messages such that m belongs to \(\mathscr {T}(m)\) and if a message \(m'\) is in \(\mathscr {T}(m)\) then all messages \(m''\) such that \(p(m'')=m'\) also belong to \(\mathscr {T}(m)\). In other words, \(\mathscr {T}(m)\) contains exactly m, the answers to m, the answers to these answers, and so on. The focus of this paper is the study of structural and temporal features of these threads.
Our data contains incomplete threads: the ones that have an email in our dataset but began before and/or continued after the data collection period. Some threads also exhibit inconsistencies, for instance a reply has a smaller timestamp than the message it replies to. We remove those threads, as well as all threads that last for more than 2 years, or that start 2 years before the end of our data collection.
After this bias correction procedure, we obtain \(n=554233\) emails, involving 34648 distinct authors over a duration of 598532269 s (18 years, 11 months and 19 days) and 116999 threads.
3 Framework and Notations
Our goal is to study the structural and temporal properties of threads within a mailing-list archive. In order to do so, we propose a model of the data that captures both its temporal and structural nature, and allows for easy manipulation of threads.
We model our mailing-list archive as the link stream \(D = (T_D, V_D, E_D)\) with \(T_D = [\alpha ,\omega ]\), \(V_D = \{a(m): m\in \mathscr {D'}\}\) and \(E_D = \{(t(m), a(m),a(p(m))): m\in \mathscr {D'}\}\) where \(\mathscr {D'}\) is the set of emails in our dataset after cleaning. In other words, a triplet (t, u, v) in \(E_D\) indicates that individual u answered to an email of individual v at time t.
Such a link stream naturally contains sub-streams: \(L' = (T', V', E')\) is a substream of \(L = (T,V,E)\) if and only if \(T'\subseteq T\), \(V'\subseteq V\) and \(E'\subseteq E\). In other words, all the interactions of \(L'\) also appear in L. Given a set of nodes S, we define the sub-stream L(S) of L induced by S as the largest sub-stream of L such that all the links in L(S) are between nodes in S.
Any link stream \(L=(T,V,E)\) also induces a graph \(G = (V_G, E_G)\) where \(V_G = \{u: \exists t\in T, v\in V\) s.t. \((t,u,v) \in E\}\) and \(E_G = \{(u,v): \exists t\in T \) s.t. \((t,u,v)\in E \}\). In our case, the whole mailing-list archive induces the graph G(D) among authors of emails, and each thread induces a sub-graph of G(D).
In a graph \(G=(V,E)\), a community structure is defined by a partition \(C = \{C_i\}_{i=1..k}\) of V into k communities. In other words, \(\bigcup _i C_i = V\) and \(C_i \cap C_j = \emptyset \) whenever \(i\not =j\). In a similar way, one may consider a link stream \(L = (T,V,E)\) and a partition of its links into k sub-streams \(P = \left\{ P_i = (T_i, V_i, E_i)\right\} _{i=1..k}\). In other words, for any \((t,u,v)\in E\), there exists a unique j between 1 and k such that (t, u, v) is a link of \(E_j\).
The threads in our email dataset are exactly a partition of the whole stream, which we denote by \(\mathscr {T} = \{P_i\}_{i=1..k}\) where k is the number of threads and each \(P_i\) is a sub-stream representing a thread (with our notations above, there exists a message m such that \(P_i = \mathscr {T}(m)\)). See Fig. 1.
Notice that, although the threads are a partition of the whole stream, their induced graphs may overlap: some nodes and links of G(D) belong to several sub-graphs \(G(P_i)\). As a consequence, threads do not induce a partition of G(D) into communities. Instead, one may see the partition of D into threads as a community structure, and this is the focus of our work.
Notice finally that we consider that links are undirected (i.e. \((t,u,v) = (t,v,u)\)) and happen at an instant in time (regardless, for instance, of when the message is read). Taking into account the direction and duration of links is out of the scope of this work.
4 Basic Statistics
In this section, we present the basic statistics describing the threads in our dataset and the whole archive.
The most basic description of our data certainly is the number of links (i.e. emails) they contain, the number of distinct nodes (i.e. authors) involved, the number of distinct links they contain (distinct pairs of authors in direct interaction), and their duration (time from the first email to the last one). Figure 2 display the distribution of these values for each thread.
Complementary cumulative distributions for basic statistics of our raw (solid line) and filtered (dotted line) datasets. Top left thread sizes (number of messages per thread); top right thread durations (time elapsed between the first and the last message of the thread); bottom left number of distinct authors; bottom right number of distinct pairs of authors
Left Correlations between size and duration of threads. Right Correlations between size of threads and the number of authors involved
In a link stream \(L = (T,V,E)\) with \(T = [\alpha ,\omega ]\), we define, for all \((u,v) \in V\times V\), the maximal sequence \(t_{uv} = (\alpha , t_0, \dots , t_k, \omega )\) such that for all i between 0 and k, there exists \((t_i, u, v)\in E\), and for all i between 0 and \(k-1\), \(t_i \le t_{i+1}\). In other words, \(t_{uv}\) is the ordered sequence of apparitions of the link (u, v) to which we add \(\alpha \) and \(\omega \).
Left Inter-contact times distribution in the Debian mailing-list dataset. Right Evolution of the \(\varDelta \)-density of the link stream for \(\varDelta \) from 1 s to 20 years
5 Interactions Within Threads
The key feature of communities is the fact that they form dense subgroups. This section is therefore devoted to the study of density of interactions within threads, from both structural and temporal point of views.
5.1 Density of Threads
In a graph, the density is the probability that two randomly chosen nodes are linked together. In other words, it captures the extent at which all nodes are directly connected to each other. The density of the graph G(D) induced by our dataset is \(3.139\times 10^{-4}\).
In order to study the \(\varDelta \)-density in our data, we first have to choose an appropriate \(\varDelta \). We use here several values which capture email dynamics at different scales: \(\varDelta = \) 1 min, 1 hour, 1 day, 1 week, 1 month, 1 year and 20 years (the whole duration of the dataset). Figure 4 (right) displays the evolution of the \(\varDelta \)-density of the stream for all theses values of \(\varDelta \). It shows that the \(\varDelta \)-density is small for small \(\varDelta \)s, and converges to the density of the graph induced by the email exchanges (in our case, \(3.139\times 10^{-4}\)).
In Fig. 4 (right), the inflexion points give information on the values of \(\varDelta \) where the dynamics change. Still, looking at the density of the whole stream is very coarse and yields little information. A finer approach consists in looking at the \(\varDelta \)-density of relevant sub-streams. In our case, the threads between authors are a natural object to study.
5.2 Intra-thread Density
In our data, the inverse cumulative distribution of intra-thread \(\varDelta \)-densities are in Fig. 5 (left) for several values of \(\varDelta \) ranging from 1 min to 1 year. For each point on the x-axis, the plot gives the proportion of threads in the mailing-list that have an intra-thread \(\varDelta \)-density higher than x. As expected, the higher the \(\varDelta \) used, the higher the density is. However, there is no significant change between a \(\varDelta \) of 7 days and a \(\varDelta \) of 1 year.
Left Inverse cumulative distributions of values of intra-thread \(\varDelta \)-density for different \(\varDelta \)s. Right Inverse cumulative distributions of values of inter-thread \(\varDelta \)-density for different \(\varDelta \)s
This shows that threads are indeed dense substreams in our link streams.
6 Relations Between Threads
In the previous section, we focused on structural and temporal properties inside threads, compared to the whole link stream. We now turn to the study of relations between threads.
6.1 Inter-thread Density
Correlations between inter- and intra-thread densities for different values of \(\varDelta \). a \(\varDelta = 1\) day b \(\varDelta = 1\) year
6.2 Graphs Between Threads
Relations between sub-streams \(L_i\), \(i=1..k\), may have different forms, and in particular they have a temporal and a structural nature. In order to capture the temporal relations between sub-streams, one may define the temporal overlap graph as follows: \(X = (V,E)\) with \(V = \{ i, i=1..k \}\) and there is a link (i, j) in E whenever \(P_i\) and \(P_j\) have a temporal intersection (i.e. \([\alpha _i,\omega _i] \cap [\alpha _j,\omega _j] \not = \emptyset \)). Likewise, one may define the node overlap graph as follows: \(Y = (V,E)\) with \(V = \{ i, i=1..k \}\) again and there is a link (i, j) in E whenever there is a node v involved in both \(P_i\) and \(P_j\) (i.e. there exists a t, a \(t'\)n a u and a \(u'\) such that there is a link (t, u, v) in \(P_i\) and a link \((t',u',v)\) in \(P_j\).
The graphs contain 116999 nodes (the number of threads) and about 2 million edges for the temporal overlap graph and 63 millions for the node overlap graph. These graphs encode much information about relations between threads. For instance, the degree of node i in X is the number of threads active at the same time as \(P_i\).
We display in Fig. 7 (left) the correlations between the degree in X and the thread size. There is a clear correlation between the thread duration and the degree in temporal overlap graph when threads have a duration of at least \(10^5\)s. Also, it appears that some time up to \(10^4\) threads are present simultaneously as reflected by the maximal degree.
Left Correlation between the degree in the time overlap graph X and the thread size. Right Correlation between the degree in the node overlap graph Y and the thread duration
6.3 Quotient Stream
Top An example of graph exhibiting communities and its corresponding graph quotient. Bottom An example of link stream with communities and its corresponding quotient stream
\(\varDelta \)-density of the link stream and the quotient stream as a function of \(\varDelta \), for \(\varDelta = 1mn, 1h, 12h, 1d, 37d, 30d, 1y\) and 20y
To deepen our understanding of our data, we capture here both temporal and structural nature of relations between sub-streams. We define the quotient stream induced by a partition \(P = \left\{ P_i = (T_i, V_i, E_i)\right\} _{i=1..k}\) of link stream L as the stream \(Q = (T_Q, V_Q, E_Q)\) such that \((P_i,P_j,t)\in E_Q\) if and only if there exists \((u,v,t_1)\) in \(E_i\), \((u,v',t_2)\) in \(E_i\) and \((u,v'',t)\) in \(E_j\) with \(t_1 \le t \le t_2\). In other words, there is a node u that has a link within \(P_j\) occurring between two of its links in \(P_i\). This means that u is involved in the two streams during the same time period.
The quotient stream induced by the threads in our dataset has 12281269 links and involves 68524 distinct nodes (i.e. threads). Since our dataset contains 116999 threads, this implies that 48475 threads are not in relation with any others.
Figure 9 shows the \(\varDelta \)-density of the quotient stream and the \(\varDelta \)-density of the original stream for different values of \(\varDelta \). The quotient is not very \(\varDelta \)-dense, i.e. threads are not densely connected together, though it is slightly denser than the stream for large values of \(\varDelta \). This is comparable to graphs.
Notes
Acknowledgments
This work is supported in part by the French Direction Générale de l’Armement (DGA), by the Thales company, by the CODDDE ANR-13-CORD-0017-01 grant from the Agence Nationale de la Recherche, and by grant O18062-44430 of the French program PIA—Usages, services et contenus innovants.
References
- 1.Dorat, R., Latapy, M., Conein, B., Auray, N.: Multi-level analysis of an interaction network between individuals in a mailing-list. In: Annales des télécommunications, vol. 62, pp. 325–349. Springer (2007)Google Scholar
- 2.Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010)MathSciNetCrossRefGoogle Scholar
- 3.Sowe, S., Stamelos, I., Angelis, L.: Identifying knowledge brokers that yield software engineering knowledge in OSS projects. Inf. Softw. Technol. 48(11), 1025–1033 (2006)CrossRefGoogle Scholar
- 4.SPI. Debian mailing-list archive: https://lists.debian.org/debian-user/
- 5.Sun, J., Faloutsos, C., Papadimitriou, S., Yu, P.S.: Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, pp. 687–696, New York, NY, USA. ACM (2007)Google Scholar
- 6.Viard, J., Latapy, M.: Identifying roles in an ip network with temporal and structural density. In: 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 801–806. IEEE (2014)Google Scholar
- 7.Wang, Q.: Link prediction and threads in email networks. In: 2014 International Conference on Data Science and Advanced Analytics (DSAA), pp. 470–476. IEEE (2014)Google Scholar