1 Introduction

Online Social Networks [14] (OSNs) are cornerstones of today’s communication. Indeed, OSNs such as Facebook are nowadays used as the main communication channel by people because of their power and versatility [27]. It is very common to see people share their personal information, photos, etc., with friends and with strangers [15]. This communication mechanism is particularly relevant in virtual online communities, which represent one of the most important emerging feature of OSNs. Indeed, a current trend of Social Media is to offer members the opportunity to establish and join groups of people online by creating virtual communities based on similar interests. These virtual communities can be called Online Social Groups (OSGs) [36, 40], and they model a set of users interacting in discussions about real-world events, hobbies, similar interests, etc. In OSGs people have the chance to discuss ideas, doubts or interests with a big number of people that would be otherwise difficultly reachable in real life or with other traditional communication systems, that are mostly dedicated to private, one-to-one ways of messaging [10, 12]. These OSGs create environments where the information spreading tend to be even more effective [4, 13, 24] than the one seen in the word-of-mouth effect [1]. And, even more than that, in contrast to how an OSN is usually used, where a user knows all of its contacts, users in OSGs do not necessarily know each other, and there is no way of limiting the interactions within the group.

An OSG is defined in [2] as a collection of people which can be divided into two categories: an extension of social identification (individuals affiliate with organizational memberships, gender, age, etc.) and related to structured communication, built around communication (social support, political debate, or similar interests). Despite the growing importance of OSNs in facilitating social communication, there has been limited research focusing on the communication patterns in OSGs. OSGs, as typical complex networks, can be modelled by graphs which are time-dependent [6]. Indeed, they need to be studied as dynamic networks to understand their real characteristics, as demonstrated in [16, 17] for what concerns the study of community detection.

1.1 Motivations and contributions

OSGs can be represented as temporal graphs, and significant recurring patterns of interaction, or communication patterns, between nodes can be found. Communication patterns can be identified using several graph concepts: from motifs [28, 29] to graphlets [23], from subgraphs [37] to temporal greedy walks [38]. In particular, the study of temporal motifs has attracted a lot of interest and showed how motifs can be helpful to understand particular characteristics of the human behaviour, such as homophily [9, 42], mobility [39], preferences [35], analysis of trends [26], but also human brain [3, 11], stock prediction [25], weather prediction [31] and many other fields. Today, one of the main problems to understand communication patterns with current proposals is the use of fixed “size” structures [23]. Especially in the case of motif detection, that is the most used approach, they only deal with structures of limited size in terms of nodes and edges involved in the motif due to the complexity of the problem. This introduces a very big limitation of the usefulness of the approach if we want to capture arbitrarily complex communication patterns. The study of small and fixed motifs can help in the understanding of common communication patterns, but the communication patterns do not have, in principle, a fixed structure and are not limited to few nodes/edges. Other notions, such as the one of subgraphs and temporal greedy walks, try to overcome this limitation by building models that are not limited in this direction. However, they completely lose any kind of causality relation between the interactions that make up the pattern by not fixing any structure. In this paper, we propose a new model for the identification of patterns in OSGs that overcome the constraint of the limited size proper of the motifs and graphlets, and, at the same time, overcome the lack of structure and explicit causality relation of the temporal subgraphs and temporal greedy walk. To this aim, we propose and define the concept of Incremental Communication Pattern. We apply this concept to a set of Facebook Groups, because the study of temporal communication patterns in the scenario of OSGs is still missing. We study a set of five patterns designed specifically for social networks, such that possible specific roles of the users within the OSGs emerge. For the sake of readiness, we propose the following contributions:

  • we propose and formalize the definition of Incremental Communication Pattern, using a generic temporal graph formalism, by the means of a basic pattern and an incremental rule;

  • we propose and formalize five Incremental Communication Patterns which are crucial in OSGs to identify specific communication patterns and the social role of actors involved in;

  • we study both the recurring and the maximum size, defined as the number of edges appearing in the largest pattern detected, for each Incremental Communication Pattern, in order to understand the complexity of the related communication, by exploiting a real dataset of 17 Facebook groups.

The rest of the paper is structured as follows. In Sect. 2, we present an overview of the state of the art in terms of motifs, graphlets, subgraphs and greedy walks in OSNs. In Sect. 3, we describe in detail our scenario, and in Sect. 4 we propose our idea of Incremental Communication Patterns. In Sects. 5 and 6, we describe the dataset and the obtained results, respectively. Finally, in Sect. 7 we draw the conclusions and propose some possible future works.

2 Related work

Many complex networks, such as Online Social Networks, are considered temporal networks [22]. Indeed, relationships appear and disappear due to the temporal evolution of the social relationships [20], or due to the offline/online state of users. Usually, social networks are modelled by using a graph. Graphs are a natural representation of a set of entities and the relationships among them, and by considering a social network as a temporal network, temporal graphs are the general model used to model their dynamic nature. Small subgraph patterns, such as motifs or graphlets, together with other indicators, are crucial to understand the structure and the evolution of the graphs [34]. When dealing with social networks as temporal networks, patterns are affected by the temporal order of interactions, and it is represented by a temporal motif, which take into account the changes of the network. A first notion of a temporal network motif, proposed in [34], define it as sequences of edges that are time-ordered and confined within a temporal interval of length. Instead, in [29] temporal motifs are defined as equivalence classes of temporal subgraphs whose events happen at a distance smaller than a fixed value and are consecutive. For what concerns the studies oriented to evaluate the temporal motifs in complex networks, important contributions have been presented in the field of communication networks (mobile or social networks), because temporal motifs are a useful tool to represent and analyse temporal communication patterns in order to find important characteristics of a social network, but also anomalies and unexpected patterns.

Zhao et al. [47] extrapolate motifs from two datasets derived from Call Detail Records and Facebook. Authors find the most common motifs using a time window of 4 hours. In [46], authors explore 3-event temporal motifs in six datasets to understand the human interactive patterns, and discovered three dominant temporal motifs: Star, Ordered-chain, and Ping-Pong, with the corresponding interactive patterns of Leader, Queue, and Feedback, respectively. Kovanen et al. [29] take advantage of the null model concept to show how factors like sex and age may have an influence on communication patterns. This property is called temporal homophily, i.e. the tendency of individuals with the same attributes to participate in the same communication patterns. Creusefond et al. [7] study temporal motifs by using the communities structure. From the experiments, they observe star and chain motifs are frequent inside communities where there are the same set of actors, instead spams, ping-pong, and triangles occur in different communities (inter-community edges). In [5], authors find triadic motifs in a Science Library dataset, because a triad is a pattern that connect three actors, and it is considered to be fundamental structural pattern of social networks [21]. In [48], authors focus on cohesive social groups built by exploiting relationships by mobile phone. They propose a methodology to identify cohesive groups and extract temporal motifs to show how members of social groups interact by means of calls and text messages. Liu et al. [30] analyse temporal motifs by using the notion of stochastic temporal network motif which models the sequential dependency of communications within a communication pattern using a first-order Markov chain. Xuan et al. [45] use temporal motifs to reveal collaboration patterns in task-oriented social networks, which are networks contain different types of nodes and links and identify people which collaborate to produce different kinds of artifacts, such as movies, music, etc. In [32], the author uses the concept of temporal motif to characterize the behaviour of the nodes in the network. Lastly, Wu et al. [44] enhance the definition of temporal motifs considering the labels on the edges.

Also graphlets have been studied in a temporal fashion. In [23], authors tackle the problem of counting all the possible graphlets with a given number of nodes and edges. They also try to characterize graphs and nodes of the graphs based on the graphlets found. Even though they present an approach which is not limited in the number of nodes or edges involved in the graphlet, they limit the experimental evaluation to graphlets made of up to 10 nodes and 10 edges.

The idea of Time-respecting subgraph, introduced in [37], is very similar to the one of graphlets. The idea is to put together all edges which share at least one end and that happen within a threshold. The idea is based on executing a breadth-first search starting from a given seed interaction, but adding a temporal constraint to the problem.

Finally, authors of [38] study the greedy walks on temporal networks to detect the burst trains and dominant factors. The intuition behind this work is that, if a set of users is particularly active and interact a lot each other, a temporal greedy walk will remain struck among these few nodes. Also this approach is not limited in the number of nodes and edges that can be included in the structure; however, the returned walk may not be relevant because there is no needed structure and because of the greediness of the walks returned.

At the best of our knowledge, what is lacking in literature is a study which focuses on the longest meaningful patterns that one can find. While they give generic definitions, current proposals of temporal motifs only study the problem with a fixed and rather low length, which means a few number of nodes (generally speaking no more than 4). Other works still focus on short patterns and count the most frequent ones, without really grasping the fact that human communication is not bound in size, or find structureless patterns, that are not capable to isolate very specific and out-of-the box behaviours. Instead, in this work we define and analyse Incremental Communication Patterns in order to discover recurring and structured sets of interactions, with a particular interest in the maximum pattern length to understand specific social roles, such as influencers or bots.

3 Foundamentals

In this Section, we introduce the reader to the Online Social Groups, that is the scenario we consider in this paper, and to the basics of the modelling of interactions via a temporal graph.

3.1 Online social groups

One of the main current functionality offered by Social Media is the possibility of creating groups of Social Media members [10, 43], such as Facebook Groups, the circles in Google+, etc. A specific property of OSGs is that people do not necessarily need to know all the other members of the group, meaning that no explicit consent is required to start interacting.

Inside OSGs, users can typically write posts, and other users can interact with posts in a number of ways. The most significant way of interaction with posts is via written comments [12]. In many OSGs platforms, users can also interact with these comments with other comments, just like with the post. This creates a potentially infinite tree-like structure of interactions between users which enables very complex communication structures among them. In this work, we focus on such OSG scenario because of the ever increasing interest of people to join virtual communities and the lack of a depth analysis of both the characteristics and the communication patterns in this specific case. Indeed, the study of the temporal communication patterns is very important to understand the role of a user in a virtual community, such as a social group. Roles of the users, such as the so-called influencers, are usually analysed by exploiting centrality measures on a graph representing all the interactions of the users. However, this strategy is too much general and do not take into account a causality relation among the interactions of the users. Thanks to the idea of Incremental Communication Patterns, we can not only identify the most frequent communication patterns, but also study the characteristics of the communication (engagement, length, etc.) to detect the activity of users in a time span. This is a tool to study anomalies and specific social roles of users because the rules that defines them try to capture specific behaviours of the OSG scenario. Finally, it is also important to note that the proposed approach is also valid in the case of OSNs analysis, and it can be also used in several other contexts, such as analysis of network packets, of bitcoin transaction network, and of email communication networks to name a few.

3.2 Temporal multigraph: definition

To study communication patterns in OSGs, we are interested in modelling how the users in the group interact with each other. To this aim, we introduce the interaction graph as a graph in which the nodes represent the users of the OSG and the edges represent the interactions between users. It is also of primary importance to enrich the graph with labels on the edges of this graph corresponding to the timestamp at which the interaction happened in order to establish possible causality relations. Such interaction graph can be modelled with a temporal multigraph \(G=(V,E)\). A temporal multigraph \(G=(V,E)\) is a graph where V is the set of the nodes in the graph, E is the set of edges and each edge \(e\in E\) is a tuple \(e=(s, d, t)\), where \(s\in V\) is the source of the edge, \(d\in V\) is the destination of the edge, and \(t\in \mathbb {R}\) is the timestamp at which an event, an interaction in our scenario, happens. Using the temporal multigraph formalism, we can model the users as the set of nodes V of the graph, and the interactions as the set of edges E. As we already said in Sect. 3.1, we identify as interactions comments to posts and comments to other comments, but we do not consider writing a post an interaction itself. While it easy to know who wrote the post, it is not easy to determine to whom in particular the post is addressed to. For this reason, we assume that posts are just handles from which real interactions can start. For each comment to a post, we will create an edge with source the node who wrote the comment and, as destination, the author of the post. Note that if a user writes a comment to one of its post, an interaction towards itself is generated. Analogous is the case in which a user writes a reply to a comment: In this case, an interaction is generated from the author of the reply to the author of the comment. For example, consider the case in which user A writes a post at time \(t_a\), then user B writes a comment to A’s post at time \(t_b\), and finally user C writes a comment to B’s comment (see Fig. 1 for an example). This results in two edges in the graph: \((B, A, t_b)\) and \((C, B, t_c)\). Due to the possible presence of bots and other tools for spamming, we may encounter two interactions originating from the same user, directed to the same user, happened at the same time. To be able to distinguish these two interactions, we suppose that each interaction has an unique identifier.

Fig. 1
figure 1

A fake post of an OSG. Considering this interaction, we will record two edges. The first edge is (Pr0Usr, New User, t\(_{1}\)) and corresponds to the comment by Pr0Usr to the post made by New User, the second one is (New User, Pr0Usr, t\(_{2}\)) and corresponds to the comment by New User to Pr0Usr

4 Incremental communication patterns

In this section, we introduce the concept of Incremental Communication Pattern giving a generic idea of its structure and the motivation behind it. Then we give the definitions of 5 social patterns specifically thought for the scenario of OSGs. A communication pattern can be simply seen as a set of interactions which underlines that a specific communication happened among a set of users. The communication pattern does not only specify the direction of each interaction, but it also specifies the relations among the involved users, because the interactions have explicit sources and destinations. The biggest limitations of (fixed) communication patterns are that they are not able to capture, per se, the fact that communication is not bound by the number of users involved or by the number of interactions. To overcome this limitation, we introduce the concept of Incremental Communication Pattern. An Incremental Communication Pattern enhances the general idea of a communication pattern, usually composed by few users and interactions, by introducing a rule to add more users and interactions to the pattern. By applying recursively the rule, we can thus generate more complex communication patterns in an incremental fashion.

To detect an Incremental Communication Pattern, it is necessary to find, in the temporal graph of interactions, a basic pattern and then, to apply as many times as possible the incremental rule to create the largest instance of the communication pattern, according to the order in which the interactions appear. To be able to classify the different instances of the patterns, which depend on how many times we were able to apply the incremental rule to a basic pattern, we also define the size of an incremental pattern as the number of edges appearing in the pattern. We will use the letter k to address to the size of the pattern and say that it has size k if a specific instance of the pattern contains k temporal edges. As we will present later in this section, the basic rule defines small patterns of size 2. It is impossible to define a smaller basic pattern for two reasons: A smaller basic pattern would cause confusion as the basic patterns cannot be distinguished from each other; moreover, a pattern consisting of one edge is just an interaction, and it completely loses any structure. It is also worthwhile to notice that there is not any upper bound of the size of an instance of an Incremental Communication Pattern, due to its recursive definition.

The idea we propose is, to a certain degree, closely related to the one of motif, already present in literature. In fact, finding an occurrence of the basic pattern in the graph is the same problem of motif detection, that is a subgraph isomorphism problem. However, we model the evolution of a basic pattern over time due to the addition of arbitrary interactions, we use an incremental rule, instead of defining a new motif.

Fig. 2
figure 2

The disadvantage of having fixed time windows. a a fixed time window approach may cause interactions belonging to the same pattern, denoted with “X” in the figure, to be split in two distinct time windows. b the timeout approach is much better for searching maximal patterns

Another very important aspect to consider when detecting such communication patterns, without a fixed structure, is the time. In particular, we want to set a timeout for the pattern, that is an amount of time within all the interactions of a given instance of a pattern have to appear to be considered part of the pattern. A similar idea is already present in literature and it is called time window [41], but the concept of time window is usually bound to the absolute time dimension. While this is still a reasonable approach in many fields, this is not the case in our scenario. In fact, using a fixed time window, that sums up in slicing time in fixed size buckets and then study the problem within each bucket separately, may introduce a problem. Suppose we have the situation depicted in Fig. 2 and that we choose a time window \(\Delta t\) such that the time is divided as in the case A. If all the interactions are part of the same pattern, we are losing part of the pattern, because of the unfortunate division of time. A better approach would be the one in Fig. 2b, in which we set the start of the detection process the very first interaction of a pattern. In this case, the temporal aspect of the problem is directly bound to each instance of a pattern, and, if the timeout \(\Delta t\) is big enough, we are able to detect the whole pattern, rather than just a part of it. Anyways, the definition of Incremental Communication Pattern is decoupled from the concept of time window, such that a custom time window can be defined depending on the scenario we want to analyse.

In this work, we study and evaluate four specific Incremental Communication Patterns: k-chain, k-in-star, k-out-star, and k-ping-pong. Moreover, we introduce a new communication pattern for OSGs, namely the k-one-way couple. Each Incremental Communication Pattern presented in this paper expresses an interaction template which can be commonly observed in the context of OSGs, as we will explain in detail in the next subsections. In the following, we firstly give a definition of the Incremental Communication Pattern, in terms of the basic pattern and the incremental rule, and then we discuss the rationale behind the proposed patterns.

4.1 The chain pattern

Definition 1

(k-chain) The k-chain pattern can be expressed using the temporal graph formalism with the following basic pattern and incremental rule:

$$\begin{aligned} \begin{aligned} \text {basic}\,\text {pattern}:\,&KC(2)=(V_{KC_2}, E_{KC_2})\\&V_{KC_2}=\left\{ u_i\vert i\le 3 \wedge i\in \mathbb {N} \wedge \forall j<i. u_j\ne u_i\right\} \\&E_{KC_2}=\left\{ (u_2, u_1, t_1), (u_3, u_2, t_2)\vert t_1<t_2\right\} \\\\ \text {incremental}\,\text {rule}:\,&KC(k+1)=(V_{KC_{k+1}}, E_{KC_{k+1}})\\&V_{KC_{k+1}}=V_{KC_{k}}\cup \left\{ u_{k+2}\right\} \\&E_{KC_{k+1}}=E_{KC_{k}}\cup \left\{ (u_{k+2}, u_{k+1}, t_{k+1}) \vert t_k<t_{k+1} \right\} \\ \end{aligned}\end{aligned}$$
(1)

The k-chain basic pattern consists of a 4 nodes path graph, where each node is linked to the previous and the following one according to their labels. Moreover, it is also required that the interaction happens in a specific order, that is the i-th edge must have, as source, the i-th node, and as destination, the \((i-1)\)-th node. The incremental rule of the pattern aims at making the chain longer, trying to attach a new edge, and a new node, after the last node that appeared in the pattern (see Fig. 3).

In our scenario, this motif is important because it symbolizes how much a content is able to engage more and more different users. For instance, if a Facebook post creates a lengthy k-chain pattern, we can say that the original post, and the discussion around it, is extremely engaging. Such pattern can reveal the importance of a content.

4.2 The star patterns

The star is an important pattern which can reveal users that perform particular roles such as the influencer or the bot. In this work, we identify two types of star patterns: the in-star and the out-star. In both cases, at least four nodes connected in a star fashion are required, one of which is the central node and the others are peripheral ones.

Definition 2

(k-in-star) A k-in-star can be expressed using the temporal graph formalism with the following basic pattern and incremental rule:

$$\begin{aligned} \begin{aligned} \text {basic}\,\text {pattern}:\,&KIS(2)=(V_{KIS_2}, E_{KIS_2})\\&V_{KIS_2}=\left\{ u_c \right\} \cup \left\{ u_i\vert i\le 2 \wedge i\in \mathbb {N}\right\} \\&E_{KIS_2}=\left\{ (u_{i}, u_{c}, t_i)\vert i\in \mathbb {N} \wedge i\le 2 \wedge t_i<t_{i+1}\right\} \\\\ \text {incremental}\,\text {rule}:\,&KIS(k+1)=(V_{KIS_{k+1}}, E_{KIS_{k+1}})\\&V_{KIS_{k+1}}=V_{KIS_{k}}\cup \left\{ u_{k+1}\right\} \\&E_{KIS_{k+1}}=E_{KIS_{k}}\cup \left\{ (u_{k+1}, u_{c}, t_{k+1}) \vert t_k<t_{k+1} \right\} \\ \end{aligned}\end{aligned}$$
(2)
Fig. 3
figure 3

The k-chain pattern. a shows the basic pattern, while b shows how the incremental rule increases the size of the pattern

The basic rule defines a star graph in which all nodes have one outgoing edge, except the central node which has no outgoing edges.

The destination node of all the edges is the central node, which is the node having a special role to be identified. The incremental rule models the ability of the central node to attract even more different interactions towards it (Fig. 4). In fact, a new interaction can be added to the pattern if it involves a new user, and if the interaction has as source node the new user and as destination node the central user.

The communication patterns among members of an OSGs can provide important information about the role played by some individuals of the group. Indeed, a k-in-star pattern represents a communication pattern where the central node attracts a high number of interactions from the other members, resulting in an influence process. In our scenario, the number k of edges of the communication pattern determines the amount of group’s members who interacts with the specific user of the group, which is a typical trait of influencers.

Definition 3

(k-out-star) A k-out-star can be expressed using the temporal graph formalism with the following basic pattern and incremental rule:

$$\begin{aligned} \begin{aligned} \text {basic}\,\text {pattern}:\,&KOS(2)=(V_{KOS_2}, E_{KOS_2})\\&V_{KOS_2}=\left\{ u_c \right\} \cup \left\{ u_i\vert i\le 2 \wedge i\in \mathbb {N}\right\} \\&E_{KOS_2}=\left\{ (u_{c}, u_{i}, t_i)\vert i\in \mathbb {N} \wedge i\le 2 \wedge t_i<t_{i+1}\right\} \\\\ \text {incremental}\,\text {rule}:\,&KOS(k+1)=(V_{KOS_{k+1}}, E_{KOS_{k+1}})\\&V_{KOS_{k+1}}=V_{KOS_{k}}\cup \left\{ u_{k+1}\right\} \\&E_{KOS_{k+1}}=E_{KOS_{k}}\cup \left\{ (u_{c}, u_{k+1}, t_{k+1}) \vert t_k<t_{k+1} \right\} \\ \end{aligned}\end{aligned}$$
(3)

The basic pattern of a k-out-star defines once again a star graph, but this time the direction of the edges is opposite: all edges originate from the central node and have as destination a peripheral node. The incremental rule accepts a new edge with the central node as source node and a new peripheral node as destination node, and it is used to detect new interactions made by the central node towards new users.

In our scenario, this communication pattern is most useful to detect very active users which tend to interact with everyone, and combining this information with other patterns can be useful in detecting spammers and bots. Indeed, we expect these nodes to produce a huge amount of interactions but not receiving any because they are ignored by other real users. We also expect that these users aim to reach a huge number of other users, but not to establish any well-structured communication.

Fig. 4
figure 4

The k-in-star pattern. a shows the basic pattern, while b shows how the incremental rule increases the size of the pattern. The k-out-star pattern has the same structure, but source and destination nodes of the edges is inverted

Fig. 5
figure 5

The k-ping-pong pattern. a shows the basic pattern, while b shows how the incremental rule increases the size of the pattern

4.3 The ping-pong

Definition 4

(k-ping-pong) The k-ping-pong pattern can be expressed using the temporal graph formalism with the following basic pattern and incremental rule:

$$\begin{aligned} \begin{aligned} \text {basic}\,\text {pattern}:\,&KPP(2)=(V_{KPP_2}, E_{KPP_2})\\&V_{KPP_2}=\left\{ u, v \right\} \\&E_{KPP_2}=\left\{ (u, v, t_1), (v, u, t_2)\vert t_1<t_2\right\} \\\\ \text {incremental}\,\text {rule}:\,&KPP(k+2)=(V_{KPP_{k+2}}, E_{KPP_{k+2}})\\&V_{KPP_{k+2}}=V_{KPP_{k}}\\&E_{KPP_{k+2}}=E_{KPP_{k}}\cup \left\{ (u, v, t_{k+1}), (v, u, t_{k+2})\vert t_k<t_{k+1}<t_{k+2}\right\} \\ \end{aligned}\end{aligned}$$
(4)

This pattern is possibly the most particular one as the basic pattern is defined for \(k=2\), and thanks to the incremental rule it is defined only for every even k. The basic pattern of the k-ping-pong is composed by two nodes u and v and two edges connecting them. One edge has as source node u and as destination node v, while the other edge has as source node v and as destination node u. The incremental rule aims at detecting new occurrences of an interaction happening from u to v directly followed by another interaction happening in the opposite direction. In fact, the incremental rule adds two edges to the pattern, and this is why the pattern is defined only for even k (see Fig. 5).

In our scenario, this pattern is mostly useful to detect pair of people which tend to interact a lot in a riposte and counter-riposte fashion. Moreover, each edge making the sequence must have inverted source and destination node with respect to the previous one in the sequence. Therefore, we aim to capture a causality relation between interactions happening between the two users. The k-ping-pong pattern allows us to recognize users who are involved in reciprocal interactions (such as discussion among members which goes on as long as one reply the other).

4.4 The one way couple pattern

This pattern is not present in literature, and it is part of our contribution, although it is similar to the k-ping-pong.

Definition 5

(The k-one-way couple) The k-one-way couple pattern can be expressed using the temporal graph formalism with the following basic pattern and incremental rule:

$$\begin{aligned} \begin{aligned} \text {basic}\,\text {pattern}:\,&KOWC(2)=(V_{KOWC_2}, E_{KOWC_2})\\&V_{KOWC_2}=\left\{ u, v \right\} \\&E_{KOWC_2}=\left\{ (v, u, t_i)\vert i\in \mathbb {N} \wedge i\le 2 \wedge t_i<t_{i+1}\right\} \\\\ \text {incremental}\,\text {rule}:\,&KOWC(k+1)=(V_{KOWC_{k+1}}, E_{KOWC_{k+1}})\\&V_{KOWC_{k+1}}=V_{KOWC_{k}}\\&E_{KOWC_{k+1}}=E_{KOWC_{k}}\cup \left\{ (v, u, t_{k+1})\vert t_{k}<t_{k+1}\right\} \\ \end{aligned}\end{aligned}$$
(5)

In the k-one-way couple basic pattern, we find only two nodes u and v and two edges connecting them. However, differently from the ping-pong, both edges have the same source, that is node v, and the same destination, node u, as shown in Fig. 6. The incremental rule, which adds another edge from node v to node u to the pattern, can be used to understand at which extent the communication happens unilaterally.

This pattern, combined with the k-ping-pong, is used in our scenario to model a stalker/stalked behaviour. Indeed, all the edges model interactions happening in the same direction: from the stalker to the stalked. Differently from the k-ping-pong motif, there is no interaction going back from the stalked to the stalker, and differently from the k-in-star, the source of the interactions are not different nodes but it’s the very same one (Fig. 6b).

Fig. 6
figure 6

The k-one-way couple pattern. a shows the basic pattern, while b shows how the incremental rule increases the size of the pattern

5 The dataset

In this work, we study five specific communication patterns in Online Social Groups by evaluating a set of Facebook groups composing of 17 different groups divided into 5 categories [8, 18, 19, 33]. All the groups have been chosen at random into the list of Facebook groups. We use a crawler due to the limitation of Facebook API to retrieve information about groups. We registered an account in Facebook, and we joined these 17 closed groups after the acceptance by the administration of each group. The HTTP-crawler, which relies upon SeleniumFootnote 1 to automate browser actions, periodically retrieve the following set of information from a specified Facebook group:

  • Members We retrieved the list of the users participating in the group.

  • Interactions We collected interactions occurred between the members of the groups. All interactions collected include posts, comments, and replies. Posts, comments and replies have a timestamp associated.

In particular, our crawler application was able to collect interactions of about 381,339 members belonging the 17 Facebook groups. We classify each group in one of five categories, depending on the description of the groups:

  • Education It consists of Facebook groups discussing topics related to school or university.

  • Sport It includes groups of users interested in popular sport activities such as football, tennis, or gym

  • Work: It contains groups of users focused on business, job search, and companies.

  • Entertainment It includes groups focused on media contents, such as film, music, or musicians.

  • News This category contains groups of users interested in news, debates, and political discussions.

In order to help the readers in assessing the characteristics and the nature of the collected groups, Table 1 contains the real names, as well as the categories, of the groups which have been investigated in this study.

Table 1 General description of the selected Facebook’s groups

5.1 Data statistics

Table 2 General description of the Facebook groups

Table 2 summarizes the main characteristics of each group by showing: the number of consecutive days on which our crawler collected the interactions (Days), the number of group’s member (Members), the date of both the first (min Date) and the last post (max Date) retrieved by our crawler, the total number of posts retrieved from the group in the monitored period (Posts), the number of members who are author of at least a post (Authors), and the average number of posts published each day by group members (Post / day).

The collected data indicate that our crawler was able to collect the activities performed by members over a period lasting longer than 365 days for Ed1, Ed3, S2, N3, W1, W2, and W4. Instead, other groups, in particular S3 have a high daily activity, and due to the overloading of the crawler (memory consumption), we were able to collect the activity of 28 days. The number of members of the groups are very heterogeneous, the maximum being 107,459 users for S3, while the group En3 has the fewest number of members with only 2,324. The groups expose a high activity level, in fact the number of posts collected during the monitored period is higher than 4,000 for the majority of the groups. For instance, the group S3 has more than 6,000 published posts despite the fact that the monitored period lasts 28 days only. In general, a very low fraction of the group members are also author of at least one post collected during the monitored period (at most 10% of the group members). However, the collected groups expose higher variation in the number of distinct authors, and this value is positively correlated with the total number of posts (Pearson Correlation R = 0.64). The average number of posts per day computed by considering all the groups is equals to 8. However, its value depends on the social activity of the members, and it ranges between 1.914 posts (for group S2) and 226 posts (for group S3), while the groups of the Work category expose the least activity level.

Fig. 7
figure 7

Number and types of the interactions

We investigate in more detail the amount and the types of interactions collected from each group by showing in Fig. 7 the total number of posts, comments, and replies performed by members of the different groups. As shown by the figure, the groups expose different characteristics. The groups of the Education category have very similar number of posts but they expose different numbers of comments and replies. In general, the members of groups Ed1 and Ed3 interact mainly by using replies while group Ed2 exhibits a large portion of comment interactions. We notice that also groups of the Sport category expose similar characteristics where the majority of the interactions are comments. Instead, the groups of the Work category exhibit two different trends: the groups W1 and W2 consist of few posts with a high number of comments and replies while the group W3 and W4 consist of several posts having a small number of comments and replies. Finally, the groups of the News and Entertainment category expose a similar communication pattern where members mainly interact by using comments, replies, and posts. As final remark, we can notice that, in most cases, the number of comments/replies is larger than the number of posts and, in many cases, the number of comments/replies is comparable.

6 Experimental results

In this section, we present experimental results about the detection of the Incremental Communication Patterns defined in sect. 4 in the scenario defined in Sect. 3. Experiments were carried out by considering the interactions among users belonging to the 17 Facebook groups presented in Sect. 5. The interactions were sorted based on their time label that represents the time when the interaction was received by the Facebook servers. Having the timestamp of each interaction is useful to capture the evolution of interactions of the groups. The sorted interactions are then used as input for a naive algorithm which detects the five Incremental Communication Pattern, i.e. the k-chain, the k-in/out-star, the k-ping-pong, and the k-one-way couple. For the sake of clarity, our goal is not to propose an efficient algorithm for the discovery of the Incremental Communication Patterns, rather we are interested in the study of the patterns, in the scenario of OSGs, where the communication between people is not yet studied in depth.

An essential property of our approach is that we are not searching for fixed patterns, but we are searching for the biggest pattern that appears. This is crucial to understand the property of a communication pattern in OSGs. In practice, our approach works in this way: We start from the first interaction of the stream and we try to build a basic pattern. Once a basic pattern is found, more edges are added to the instance of the pattern by applying recursively the incremental rule. The whole process stops when the next interactions fall out of the given time window, and the largest pattern is returned, or no pattern is returned in the unfortunate case in which not even the basic pattern was detected. Once the detection process is finished for the patterns starting with the first interaction, we iterate the process on the second interaction, and so on, until no more interactions can be found. For what concerns this study, we decided to filter out the instances of the patterns that match the basic pattern (except for the k-ping-pong). This decision was driven by the fact that in the particular scenario in which we apply the concept of Incremental Communication Pattern in this work, we understand that very short instances of some patterns appear too much frequently to be considered as relevant communication patterns.

As we saw in Sect. 5, the activity of the 17 groups is not homogeneous, and because of this it was not so easy to define a single length for a time window that was significant for all the groups. Indeed, a good time window length for an average group may be too short for groups with low activity and too long for groups with a lot of activity. Therefore, we decided to identify different time windows for each group studying the distribution of the interaction inter-arrival time (i.e. the amount of time passing between two successive interactions in all the group activity). Table 3 shows the results obtained by evaluating the inter-arrival time. In the group S3, which is the group in which the interaction activity is frequent, the average inter-arrival time is 31s while in W4 is 2251s. This justifies why we are not able to fix a specific time window for all groups.

Table 3 Statistical measures and quartiles for each group of the interaction inter-arrival time

Since there was no way to define a single time window for all groups a priori, we decided to study more in detail the interactions inter-arrival times. Figure 8 shows the CDF (Cumulative Distribution Function) of the interaction inter-arrival times for all the groups in the dataset, divided by category: 8a for the education groups, 8b for the sport groups, 8c for the work groups, 8e for the news groups, and 8d for the entertainment groups. Interestingly enough, all the CDFs have a similar shape, with the only difference of being shifted compared to each other. S2, W3, W4, and N3 are the only groups with a sensibly different, smoother shape, which causes the inter-arrival times to be spread over a much wider range of values. These differences are caused by a general inactivity of the users in the groups, as confirmed by Table 3 where we can see that these 4 groups have the highest average and median interaction inter-arrival time. In particular, the groups of the Education and Sport categories expose a very low median inter-arrival time among interactions while the groups of the Work category have the largest median inter-arrival time. To be able to take into account this wide range of activity in the groups, we decided to fix different time window for each group according to the interaction inter-arrival times. In order to catch interaction patterns happening in short, medium and long time, we decided to fix three different time windows for each group. The three time windows are set equal to the first three quartiles of the interaction inter-arrival time distribution for each group. Table 3 shows the numerical values chosen for each time window. Again, most groups show a very similar trend, in fact the ratio between a second and first quartile and the ratio between third and second quartile is almost always between 2.5 and 3.5. The only groups not showing this characteristic are S2, W3, W4, and N3 which have been observed to be the ones with the lowest activity.

Fig. 8
figure 8

CDF of the time passed between subsequent interactions

6.1 The recurring largest size evaluation

The first analysis performed concerns the study of the most frequent Incremental Communication Pattern maximal size which will let us understand what is the typical communication pattern length, that is the k value of the pattern returned. What we expect to find is the vast majority of patterns for low values of k (from 2 to 5), but we also expect to find larger patterns (\(k>5\)).

Fig. 9
figure 9

The number of Incremental Communication Patterns for each value of k, up to 21

Figure 9 depicts the number of Incremental Communication Patterns detected, for size of the pattern k up to 21, by using as time window length the first, the second, and the third quartile of the distribution of interaction inter-arrival times, summarized in Table 3. In general, we can observe that increasing the size of the time window has a significant difference on the total number of patterns identified in the groups. In particular, fixed the value k of the pattern, a larger number of patterns are identified as the size of the time window increases, as expected. This intuitively happens because the interactions can occur throughout a larger time span and still be included in the same pattern. Moreover, as expected, the most frequent maximal length of the patterns is 3, regardless of the time window, which tells us that the most frequent patterns of communication happens among a very small number of nodes. However, the plot also shows that there is a relevant number of patterns with non-standard length which are worth to be investigated, also in the shortest time window (the darkest one in the plot). In fact we see that, although much less common, we are able to detect patterns with 10 and more interactions, confirming our idea that one cannot restrict the search to standard and small patterns.

Fig. 10
figure 10

Most common size of the Incremental Communication Patterns studied in the groups

Figure 10 shows the distribution of the number of biggest Incremental Communication Patterns in the different group categories with k ranging from 2 to 21, one plot per pattern. Results are very interesting and are partially meeting our expectations. In general, we see that the most common values of k is 3 in all categories for all pattern, except for the k-ping-pong for which the most common k is 2, directly followed by 4 (we recall that the k-ping-pong pattern is defined only for even k). Nevertheless, we notice that the most common value of the size k always corresponds to the shortest pattern instances we consider: 2 for the k-ping-pong and 3 for all the other patterns.

Examining more in detail the plots one by one, we observe different behaviours for different patterns. Indeed, for two patterns, namely the k-chain (Fig. 10a) and the k-ping-pong (Fig. 10b), we see that the largest k observed is 7 for the former and 4 for the latter. The situation is quite different if we consider the other three patterns: k-in-star (Fig. 10c)), k-out-star (Fig. 10d), and k-one-way couple (Fig. 10e). Considering these patterns, we see that in almost all categories we observe tens of pattern instances with size equal to 10, and even more interesting is the fact that in some cases the longest patterns exceed 20 in length.

Finally, we also observe that each category show a peculiar number of pattern for each different value of k. For example, concerning the number of k-in-stars (Fig. 10c), we see that the largest pattern for groups in the Entertainment and Education categories have \(k=13\) and \(k=15\), respectively, as largest sizes, while for the Sport category we have \(k=21\) and for the News category we count more than 1000 patterns with \(k=21\). A rather different situation can be seen for the k-out-star pattern (Fig. 10d). In this situation, we observe that the largest patterns in the Sport and News categories have \(k=8\) and \(k=13\), respectively, while Education and Work categories reach \(k=18\) and \(k=20\) respectively.

Overall, this first analysis suggests us that some communication patterns do not develop much in length, while others show a much more complex structure that is worth deeper investigation. We also observed that different categories show different number of patterns, therefore to have a more accurate view, we also plan to make further analyses at group granularity, rather than category granularity.

6.2 The absolute largest size evaluation

Fig. 11
figure 11

Largest k for each motif in each group varying the length of the time window

The main reason we introduced the Incremental Communication Patterns was to overcome the limit given to the number of actors and the number of interactions happening between them, which is typical of the motifs. To show the importance of removing this limitation, we decided to dedicate our attention to the study of the absolute maximal size of the patterns reached in each group separately.

Figure 11 shows, for each pattern, the maximum value of size k that has been discovered on all the groups by using different temporal window lengths: namely the first (11a), the second (11b), and the third (11c) quartile of the interaction inter-arrival distributions (Fig. 11d). As we can see from the Figures, the size of the time window highly affects the absolute maximal size of the patterns. Indeed, considering the results obtained with the shortest temporal window in Fig. 11e we see that the most common largest patterns have \(k=4\) and \(k=5\). We also notice few outliers: the largest k-out-star in group W1 with \(k=8\), the largest k-out-star in group W3 with \(k=10\), and the largest k-in-star in group N1 with \(k=20\). It is also worthwhile to notice that the k-ping-pong is a very specific pattern, and some groups (W3, W4, and N1) do not show any of them using the shortest window length.

Moving to the second time window length (results in Fig. 11b), we see an overall increase in the size of the largest patterns, as expected. The increase is much more highlighted for the k-one-way couple, and many groups have the largest k-one-way couple size doubled with respect to the previous window length. The absolute largest pattern is still a k-in-star in the group N1, whose size is tripled with respect to the shortest time window length, reaching \(k=62\); the second largest is a k-one-way couple in the group W4 with \(k=16\).

If we analyse the results for the third and longest time window (Fig. 11c), we see again an increase of the size of the largest patterns, especially for k-one-way couple and k-in-star. Concerning the k-one-way couple pattern, we observe two opposite behaviours: in some groups the maximal size is almost the same as in the previous time window (S2, S3, W3, W4, En1, and En2), while in other cases the size increases sensibly (Ed1, S1, W1, W2, and En3). Considering the k-in-star we observe that the increase of the largest size ranges from almost twice to three times with respect to the previous time window length in all groups, with only few exceptions: Ed1, En3, En4, N2. The absolute largest pattern is still a k-in-star in the N1 group, reaching the stunning size of \(k=170\); other large patterns are the k-one-way couple in group W1, k-in-star in S2, and k-out-star in W3. An interesting result, forewarned in Sect. 6.1, is that in two groups in the Work category, namely W3 and W4, we observe that there is no k-ping-pong at all, and in all other groups we observe only small values of k.

6.3 Groups correlation

Fig. 12
figure 12

Correlation matrix of the groups based on the absolute largest size of all the Incremental Communication Patterns and for all the quartiles

Lastly, we decided to investigate the dependencies existing among the Incremental Communication Patterns of different groups by exploiting the Pearson correlation coefficient. We used a correlation matrix to show the correlation results where rows and columns represent the groups while the color of each matrix’s cell denote the correlation among the corresponding groups where white, light grey and dark grey indicates negative correlation, no correlation, and positive correlation, respectively.

Figure 12 shows the correlation matrix between groups by considering the absolute pattern size k for all the patterns and for all the quartiles. The matrix does not clearly show very high correlation values, meaning that the groups show an heterogeneous environment of maximal length patterns. We can, however, roughly identify two clusters of groups. The first one is made of the following groups: Ed1, S1, En3, N2; the second one is made of the following groups: Ed3, S3, W2, En2. The group N1 seems to have very low correlation values with other groups, showing a rather unique combination of maximal lengths of the patterns.

Fig. 13
figure 13

Correlation matrix of the groups based on the number of Incremental Communication Patterns found in each group

Seeing that there is no clear correlation in the maximal size of the Incremental Communication Patterns, we decided to study if there is correlation in the pattern count among the different groups (Fig. 13). Interestingly enough, the matrix clearly indicates the presence of two clusters of positive correlation. The first one among a large set of groups which consists of the groups of the Education and News categories, and the groups S1, S3, W1, W2, and En1. In the second cluster, we find the groups S2, W4, and En1. It is worth to notice that, while in the previous case there was no clear distinction, here we can easily identify two distinct clusters, meaning that groups can be characterized by the number of Incremental Communication Pattern detected. A very similar result is obtained if we restrict the correlation to the number of patterns only to values of k which are available to all groups, as we can see in Fig. 14. The only big difference here is that the group S2 does not seem to have high correlation with groups W4 and En2.

Fig. 14
figure 14

Correlation matrix among groups which is based on the aggregated number of patterns, for all values of k, for all sizes of the temporal windows, and for all the types of patterns

Despite having found that some of the groups show a similar number of patterns, or patterns of similar sizes, the correlations show clusters of groups with heterogeneous characteristics. In fact, we see that groups belonging to different categories and with different activity are clustered together and, alongside with that, groups with a similar activity belong to different clusters. This result confirms that the groups are highly heterogeneous, not only considering the number of users and their activity, but also considering the communication patterns that one can observe in them.

7 Conclusions and future works

In this paper, we defined and studied Incremental Communication Patterns to identify previously undetected communication structures in OSGs. In detail, we proposed and formalized the concept of Incremental Communication Pattern, together with the concept of size of the pattern, to study the communication structure appearing in fixed-length time windows. We proposed five Incremental Communication Patterns, which identify specific communication structures in OSGs, by the means of a basic pattern to identify the initial meaningful structure, and an incremental rule to be applied to the basic pattern to add more interactions to the pattern. We studied them by exploiting the interactions among users of a real Facebook dataset consisting of 17 Facebook Groups belonging to five different categories. The detection is also guided by a way of limiting the number of interactions that can be part of the pattern in terms of the time passed from the first interaction of the pattern. Results show that, beyond simple patterns which involve only a fixed number of interactions and users, a relevant set of large patterns having a non-trivial number of components can be recognized. In particular, some real groups defined in Facebook expose very complex communication patterns which engage up to 170 members of the group. Out of the five Incremental Communication Patterns defined in this paper, we also see that only some of them are prone to show a complex structure which was, up to now, undetected due to the limitation of the motifs proposed in literature. As future works, we plan to propose other specific Incremental Communication Patterns and study them in the same context using more Facebook Groups, taking into account the list of popularity of them. Moreover, we plan to investigate a more accurate mathematical model which can be applied to slice the time more consistently, which takes into account the density of the interactions and which provides a variable-length time window.