Concept-based event identification from social streams using evolving social graph sequences

  • Yi-Shin Chen
  • Yi-Cheng Peng
  • Jheng-He Liang
  • Elvis Saravia
  • Fernando Calderon
  • Chung-Hao Chang
  • Ya-Ting Chuang
  • Tzu-Lung Chen
  • Elizabeth Kwan
Original Article
Part of the following topical collections:
  1. Social Network Analysis and Information Systems

Abstract

Social networks, which have become extremely popular in the twenty first century, contain a tremendous amount of user-generated content about real-world events. This user-generated content relays real-world events as they happen, and sometimes even ahead of the newswire. The goal of this work is to identify events from social streams. The proposed model utilizes sliding window-based statistical techniques to extract event candidates from social streams. Subsequently, the “Concept-based evolving graph sequences” approach is employed to verify information propagation trends of event candidates and to identify those events. The experimental results show the usefulness of our approach in identifying real-world events in social streams.

Keywords

Event identification Concept-based evolving graph sequences Social networks 

1 Introduction

In recent years, social networking has become a fast and accessible communication tool. Sites such as Twitter and Facebook have transformed the way we access information. In the past, information was filtered down to users from mass media, largely television, radio, and the print press. Nowadays, with the support of high-speed Internet and the growth of mobile devices, users not only act as content consumers, but also as content producers.

Social network data, which are often referred to as social streams, offer a constant flow of information, ranging from status updates and opinions, to newsworthy events. Information available from social streams can typically reflect real-world events as they happen, sometimes even ahead of the newswire (Kwak et al. 2010). Take the news of Osama bin Laden’s death as an example, the first person to post about the bin Laden raid was a neighbor who complained about the noise next door on Twitter (Bell 2011). Although there is speculation concerning where the news first appeared, it cannot be denied that social media played a huge role in spreading the news. By the time official news reporting sources, such as CNN or The New York Times, confirmed that US Navy SEALS had killed bin Laden, millions had already taken to their Twitter and Facebook pages to spread the information; in this sense, the news had gone viral before official news sources had published the story.

This example gives an insight that the world can benefit from the social stream event identification for various perspectives.
  • Emergency control: Given an emergency incident, event detection from social networks could provide information in a shorter time than media reporters. This could enable government or responsible entities to have a faster response and prioritize resource distribution.

  • Crowd opinion analysis: How the crowd felt about a topic had long been the center of business. Whether it was a product, a candidate, or a subject, analyzing the collective opinions could assist organizations in revising their strategy.

  • Unreported events detection: It is very possible that the mainstream media could overlook certain events intentionally or unintentionally. Taking the Jasmine Revolution or the Sunflower student movement as examples, updates of these two activities spread in social networks were usually ahead of those in mainstream media.

Social scientists have long recognized the importance of social networks in the spread of information (Granovetter 1973; Bakshy et al. 2012). With the current widespread availability of technology and massive online social network data, properly mined events in social streams will not only help us identify important events, but also offer great help in facilitating various studies in the field of social network analysis. However, as the popularity of social networking grows, the amount of information available has swollen to become a roaring deluge. Take Twitter for example, with over 140 million active users as of 2012 (Twitter 2012), Twitter generates over 340 million tweets daily (Wasserman 2012), which contains not only useful information, but also a large amount of uninformative messages (Naaman et al. 2010). This introduces a lot of noise in the system.

Numerous studies have attempted to describe the definition of events (Allan 2002; Zacks and Tversky 2001; Becker et al. 2011). For examples, Sayyadi et al. (2009) tried to capture the keyword communities to represent events from blog documents and utilized these communities to identify events from Twitter data. Becker et al. (2011) defined an event, specifically for event identification on Twitter, as a real-world occurrence e with (1) an associated time period \(t_{\mathrm{e}}\) and (2) a time-ordered stream of Twitter messages \(M_{\mathrm{e}},\) of substantial volume, discussing the occurrence and published during time \(t_{\mathrm{e}}.\)

Social relationships are another important factor to consider. Social networks allow users to propagate information via Internet. It is reasonable to think that an important piece of information is identified as an event if it is delivered by many users and gains massive attention. In our previous work (Kwan et al. 2013), we utilized the characteristics of information propagation in social networks to identify two types of events from social streams. Yet, in the context of that study, an event could only be represented by one single keyword. For example, the event “Japan Tsunami” contains two keywords, which would have to be processed separately as two single keyword events. Having to repeat the process for each keyword would increase the time and space complexity.

The main contributions of our work are summarized below:
  • We propose a framework to identify events according to three dimensions: message, time, and social structure.

  • We represent events by the event concepts, i.e., the most essential keywords related to the event. Our method identifies event concepts using a sliding window-based keyword filtering mechanism that extracts important words and generate event candidates based on message and time dimensions.

  • We use the social structure dimension and adapt a sequence of graphs called “concept-based evolving graph sequences”(cEGS) to represent information propagation for one message. To extract valuable events out of cEGS, several measurements are introduced to analyze the graphs of cEGS. These measurements classify cEGS into three different types of events: one-shot events, long-run events, and non-events. The cEGS analysis phase also helps to reduce the influence of biased information on the obtained events. This is explained in details in Sect. 4.2.3.

  • We perform an experimental evaluation using a dataset of 280 million tweets to demonstrate that our proposed technique performs more efficient than previously proposed methods. The extracted events are evaluated by 7 participants and achieve 86 % precision. Additionally, we provide an extended evaluation on event correctness using Amazon Mechanical Turk (AMT).

The remainder of this paper is organized as follows. Section 2 is the literature review covering the related work. Section 3 explains the methodology used in this work. Section 4 presents the major experimental results. Finally, Sect. 5 is the conclusion.

2 Related work

Studies of event identification have been conducted for many years. Some statistics-based methods, such as the topic model (Blei and Lafferty 2006), text classification (Kumaran and Allan 2004), and analysis of temporal distribution (Fung et al. 2005), have been developed to observe topic trends in historical news and journal data.

As social network sites become more crucial in terms of reflecting public attentions toward events, numerous studies (Sayyadi et al. 2009; Popescu and Pennacchiotti 2010; Sankaranarayanan et al. 2009; Weng and Lee 2011; Valkanas and Gunopulos 2013; Dou et al. 2012) have demonstrated that careful mining of tweets can be used to identify real-world events. Petrovi et al. (2013) also examined the difference between Twitter-reported news and traditional newswire, indicating that Twitter has better coverage of minor events ignored by mainstream media.

Accordingly, there are three popular dimensions that are used to identify events from Twitter data. They are as follows:
  • Content: Some studies have classified content by detecting co-occurrence relationships between keywords (Sayyadi et al. 2009) or by training classifiers for specific events (Popescu and Pennacchiotti 2010; Sankaranarayanan et al. 2009). In order to have the best results, these proposed methods require large-scale datasets for relationship construction or model training. Other studies have monitored unusual frequency peaks for keywords in a specific time period (Weng and Lee 2011; Mathioudakis and Koudas 2010), but have ignored the semantic relationships between individual keywords. Sakaki et al. (2010) devised a classifier to detect earthquake events and extract events by detecting peaks in the number of tweets that mention target events. Their work utilizes both co-occurrence relationships and unusual frequency peaks to identify the events, but only focuses on specific types of events, making it difficult to generalize. In this work, we aim to implement both approaches and detect general events as well as target events. We achieve this by not only using the content found in individual users feed but also considering the content in their social circles.

  • Time: Recently, many studies have employed temporal features to identify bursty events on social networks (Alvanaki et al. 2011, 2012) employed time-sliding windows to compute statistics between tags and detected emergent events based on unusual shifts in tag correlations. Moreover, (Diao et al. 2012) considered both the temporal information of microblog posts and user interests in a Poisson-based state machine model to identify bursty periods. In addition to utilizing a statistics-based method, Guzman and Poblete (2013) proposed a windowing variation technique to rank bursty keywords according to their importance. As previous works, our approach also adapts time as an important dimension to identify events. Compared to our approach, these works can accurately identify bursty or hot topics but have a low recall when identifying more general topics as can be observed in the evaluation section. Being able to detect general topics is a desirable characteristic in event detection because there is a possibility than some topics might be discussed more often than others (e.g., sports). It does not mean that these type of topics are more significant compared to other less-discussed topics, it just indicates that there is a bigger community discussing them.

  • Social structure: Further studies have taken other social network features into account. Specifically,  Becker et al. (2011) employed user interactions (e.g., retweets, replies) during event detections. In addition, (Cataldi et al. 2010) utilized the number of followers to measure the importance of each user. Du et al. (2011) took user relationships into consideration and measured the influence of each event according to mutual attention between users. Hong et al. (2012) incorporated geographical locations in the event discovery to see how messages differ among users regarding spatial and linguistic features. Finally, Gottron et al. (2013) employed the retweeting feature as an indication of information propagation to find the time influence on Twitter. Seo et al. (2012) identified rumors on Twitter using social structures. In this type of study, only very specific events are considered and the method requires a sufficient number of monitor nodes. Ma et al. (2012) used conversational tweets and repeated tweets as training data to build an Adaptive Link-IPLSAL, a topic model, for event analysis. However, the event topics were limited and extracted from Google trends keywords. Meanwhile, (Kwan et al. 2013) monitored the social structure and time dimension to grasp events. Nevertheless, the efficiency of processing the huge Twitter graph was low, and keywords referring to one event were extracted separately. Beyond focusing on social structure, information propagation analysis is another practical approach to examine how messages diffuse through the social network. Broecheler et al. (2010) proposed an algorithm for automatically modeling the competitive diffusion from historical data, and this work is scalable enough to be applied to large social networks. Moreover, Pratt et al. (2013) utilized visual analytics and concept mapping to explore conflicts in social data streams to detect political crises. Finally, Shakarian et al. (2013) provided a diffusion model to represent user behaviors and how they diffuse in the social network. As part of our main contribution, we also consider social structure to adapt evolving social graphs in our framework for analyzing and detecting different categories of events.

Our focus is therefore to automatically identify events and evaluate event importance by considering content, time, and social structure simultaneously. We identify event types based on the topological analysis of the social graph.
Fig. 1

Framework

3 System design

3.1 Framework

Figure 1 illustrates the framework on this work. The objective of this work is to identify an event set E which occurred during the time period \(t_i\) to \(t_{j}\) from the Twitter dataset \(Tw = \{Tw_1,\ldots ,Tw_n\},\) where each \(Tw_i\) is a tweet posted between time period \(t_i\) and \(t_{j}.\)

The framework consists of five major components: data preprocessing, keyword selection, concept-based event candidate recognition, evolving social graph analysis, and event identification. Our main contribution is the construction of cEGS for every concept event candidate in the data set. As reflected by its name, cEGS is a sequence of graphs for each event candidate. It is computationally infeasible to construct an evolving graph sequence for every keyword in the dataset. Data preprocessing and keyword selection components are applied to select candidate keywords that have a high possibility of representing an event. Subsequently, we use a concept-based event candidate recognition component to collect keywords that belong to the same event. Finally, by utilizing social network relationships and considering the information propagations of each event, we apply a set of criteria to identify and rank events from the given candidates.

3.2 Data preprocessing

Although Twitter supports multi-language environments for different users around the world, our work is limited to English. Hence, the first process is to remove non-English tweets using an open source language tool (Shuyo and Nakatani 2010). Next, tweets go through the process of part-of-speech tagging (pos tagging). The purpose of pos tagging is for the next step: lemmatization. Lemmatization is the process of grouping together the different inflected forms of a word so that they can be analyzed as a single item. As an example, verbs in third-person singular form will be transformed back into their base forms. Because lemmatization requires understanding the context of a sentence; POS tagging first classifies words into different categories based on sentence structure. Finally, we apply filtering. Any words not containing alphabet characters or consisting of only punctuation marks are removed. Some special characters or strings, such as “RT” (retweeting), “@” (tagging users), and “#” (hashtag header) are also removed from the dataset.

3.3 Keyword selection

Keywords are the best representation to summarize a tweet. For example, the message “Did anyone feel that earthquake?!” can be represented by the keyword “earthquake” as the main idea. Therefore, keyword extraction algorithm is adapted for each tweet to retrieve a set of keywords \(K_i\) that can best represent tweet \(Tw_i.\)

Definition 1

(Keyword) Let K denote the keyword set extracted from tweets, where \(K_i=\{k_{i1}, \ldots , k_{ij} \}\) is a set of keywords that best represent the tweet \(Tw_i.\) The most representative keywords should satisfy the criteria: well-noticed.

3.3.1 Well-noticed criterion

Assuming that an important event appears during the time period from \(t_i\) to \(t_{j},\) this event should have received adequate attention during this period. Hence, the keywords’ occurrence number should be noticeably high. Moreover, there will be a sudden increasing trend of these keywords. In our previous study (Kwan et al. 2013), we utilized the difference of peak frequency to select monitoring keyword candidates. However, such an approach is difficult to adjust to the related parameters for different event scales, e.g., local events, world-class events, or life activities. To address this consideration, we propose a sliding window-based statistical approach in this stage to extract different types of event candidates.

Definition 2

(Word Occurrence and Co-occurrence) Let i denote a time frame, which is a unit of time period, from \(t_i\) to the beginning of \(t_{i+1}.\)\(\kappa _{i}(w)\) denotes the number of occurrences of word w within the time frame i. \(c\kappa _{i}(w_{\mathrm{a}}, w_{\mathrm{b}})\) denotes the number of co-occurrences of words \(w_{\mathrm{a}}\) and \(w_{\mathrm{b}}\) within the time frame i.

Definition 3

(Sliding Window-based Statistics) The sliding window size s represents the number of time frames under consideration. By changing the sliding window sizes, the extracted event candidates could be different. Let \(AVG_{j,s}(w_i)\) denote the average count of word \(w_i\) in the past sliding window, consisting of s time frames from time frame \(j-s\) to \(j-1.\)\(SD_{j,s}(w_i)\) denotes the standard deviation of word occurrences for word \(w_i\) in the past sliding window from time frame \(j-s\) to \(j-1.\)

Definition 4

(Keyword Category) The keywords can be categorized based on their corresponding statistics. Let \(Category =\{category_{1}, \ldots , category_{n} \}\) be a set of categories. Each category consists of keywords that satisfy the corresponding statistical criterion.

For examples, assume we have a category named Q3 that consists of high frequency keywords, where the summation of consecutive word occurrences in the past sliding window should be higher than 75 % of other keywords. By the design of keyword category, the proposed framework can extract different types of keywords based on statistics.

Definition 5

(Significant Probability) By adapting the concept of standard normal probabilities, let \(Z_{j,s}(w_i)\) denotes the Z-score word \(w_i\) at time frame i over the past sliding window s as formalized in Eq. 1.
$$\begin{aligned} Z_{j,\,s}(w_i) = \frac{\kappa _{j}(w_i)-AVG_{j,\,s}(w_i) }{ SD_{j,\,s}(w_i)} \end{aligned}.$$
(1)

By checking the Z cumulative table, we can obtain the significant probability value \(Prob (Z \le Z_{j,s}(w_i) )\) that the word occurrence in the past sliding window is less than or equal to the occurrence of current time frame.

If the significant probability of word \(w_i\) is big enough, the word has a bursting trend.

Lemma 1

If \(Prob (Z \le Z_{j,s}(w_i) ) > \delta _{\mathrm{category}}\), then word \(w_i\) will be classified as keywords in time frame j. \(\delta _{\mathrm{category}}\) is the threshold for determining keywords for the corresponding category.

3.4 Event candidate recognition

In the previous section, the keywords are selected from the Twitter set. In this section, these keywords are grouped into event candidates. Ideally, keywords are correlated based on their co-occurrence in each tweet. By utilizing the co-occurrence relationships, the related keywords for each event should be easily extracted and identified. However, such a naive approach cannot be adapted in reality, since keywords are likely linked together by other immediate keywords. This would result in a huge set of keywords identified as a single event.

To address this issue, Sayyadi et al. (2009) detected keyword communities as events based on KeyGraphs (Ohsawa et al. 1998). In their study, keywords are considered as vertices, and edges between keyword pairs are constructed once the co-occurrence probabilities of keyword pairs in the same documents are higher than a predefined threshold. By utilizing the betweenness centrality scores which are the number of shortest paths between all pairs of vertices that pass through that vertex, the higher ranking edges are gradually removed from the graphs. By iteratively removing edges with high betweenness centrality scores, the remaining connected keywords are considered as the summary of the event.

In this keyword community detection technique, the edges are undirected and unweighted. Once the edges are constructed, the strengths between keyword pairs are not considered during the community detection process. However, we believe the strengths between keyword pairs are too important to be disregarded while clustering the keywords. To take into account the edge weight, we propose a textrank based approach to cluster keywords. This approach utilizes the co-occurrence and frequency of keywords to generate the hierarchical concept graphs, which are similar to the Conceptual Maps proposed by Pratt et al. Pratt et al. (2013) but easier to implement.

Definition 6

(Event Graph) An event graph G (t) is a directed graph for the time frame t, where the vertex set V represents keywords and the edge set \(E = \{ e(w_i, w_j) | w_i , w_j \in K \}\) denotes the co-occurrence relationship between keyword pairs in the same tweet. Let \(\omega _{e(w_i,w_j)}\) denotes the weight for directed edge \(e(w_i,w_j).\)\(\omega _{e(w_i,w_j)} = \frac{c\kappa (w_i,w_j)}{\kappa _{t}(w_i)}\) for time frame t. The weight of a directed edge represents the fraction of source words \(w_i\) that appear together with the target word \(w_j.\)

Before grouping keywords into event candidates, keywords are ranked for significance. Since the keywords in the event graph are all meaningful and well-noticed, the significance degrees of the keywords indicate the importance of the event candidates. In this paper, we adapt the TextRank technique (Mihalcea and Tarau 2004) to rank these keywords. The vertex weights of the event graph G are defined as follows:

Definition 7

(Vertex Weights) Let \(w (\upsilon _i)\) denote the vertex weights in the event graph G.
$$\begin{aligned} w (\upsilon _i) = (1-d) + d* \sum \limits _{w_j \in in(w_i)}\frac{\omega _{e(w_j,w_i)}}{\sum \limits _{w_k \in out(w_j)} \omega _{e(w_j,w_k)}} w (\upsilon _j), \end{aligned}$$
where \(in(w_i)\) is the set of keywords that point to \(w_i,\)\(out(w_j)\) be the set of keywords which \(w_j\) points to, and d is a damping factor between 0 and 1.

Similar to the damping factor in PageRank, the lower damping factors decrease the relationships between words and increase the contribution to its immediate ancestors.

After ranking the keywords, Algorithm 1 is employed to identify event candidates.

3.5 Evolving social graph analysis

In this section, cEGS is first introduced and described for event identification. Figure 2 shows how the cEGS appear.
Fig. 2

Illustration of concept-based evolving graph sequences (cEGS)

Several studies (Shakarian et al. 2013; Broecheler et al. 2010) proposed methods to model how information propagates in the social networks. In these studies, the information propagation behaviors are classified into different types based on user behaviors. For example, the Weighted Diffusion Model (Broecheler et al. 2010) classified user relationships into 6 edge types with their corresponding weight. This approach is useful to represent multi-relational networks for examining how information diffuses among users. For another instance, the Multi-Attribute Networks and Cascades (MANCaLog) method (Shakarian et al. 2013) provided a logical language that can be efficiently used to express user behaviors in a multi-attribute way, such as gender of users, or weak or strong ties with other users.

These frameworks can be considered as a generalized framework for information propagation; however, it is challenging to adapt them in this paper because classifying user behaviors is not the focus of this paper. Our cEGS can be considered as a specialized framework emphasizing on the concept propagation by assuming all user (direct) relationships are equivalent.

3.5.1 cEGS definitions

Due to the nature of Twitter relationships, which are directed and asymmetric, a directed graph for cEGS is adapted in our work.

Definition 8

(Concept-Based Evolving Graph Sequences) cEGS is a sequence of directed graphs that demonstrate the information (event) propagation within social streams, where cEGSi denotes the cEGS for event candidate Ei.

Given a cEGS = G1,…,Gn where each graph Gi is a social relation graph dedicated to a specified time period τ, a larger τ results in fewer graphs in the cEGS results in fewer graphs in the cEGS which has a faster computation time but might lose important details. A smaller τ means more details but requires a larger computational effort. In this work, τ = 1 since the events should be monitored daily for better accuracy.

Furthermore, given a graph G, where V (G) and E(G) denote the set of vertices and the set of edges within it, respectively, we have the following definitions.

Definition 9

(cEGS Vertex) A vertex v represents a user that mentioned one or more keywords in candidate event E.

Definition 10

(cEGS Edge) An edge e in the graph represents the following relationship between users that mentioned keywords in candidate event E.

It should be pointed out that the graph is constructed incrementally. This means that vertices and edges inserted one day will be copied to the graph the next day. The incremental nature of this type of graph can model the information cascade, since the event leaves an impact not only today but also tomorrow and even into the following days.

The cEGS model can model the information cascade, but it cannot represent the situation for information decays. Therefore, to model it, the weight concept is further used to consider that information decays over time. For instance, information mentioned in a previous week may not be relevant to the information of the current week. Hence, there is no point in keeping irrelevant information in the model. Thus, the irrelevant vertices or edges should be removed. The weights for each vertices and edges, which represent how important the information is in cEGS, are assigned accordingly.

Definition 11

(cEGS Weight) Let ωv represent the weight for vertex v and ωe represent the weight for edge e.

For each new vertex added to the graph, its weight is set to the maximum value (i.e., ωv = 1). For each new edge added to the graph, its weight is also set to the maximum value (i.e., ωe = 1). The value will be decreased by ϖ (for simplification purposes, the value 0.5 is used in this study) of its current value for every new day. If the vertex or the edge has a new entry, the weight will be increased to the maximum value.

The vertex and the edge will be removed from the graph once its weight is below a predefined cut-off threshold. The cut-off threshold is defined as:

Definition 12

(Decay Threshold) Let λ represent the predefined decay threshold for all ωv and ωe.

The above discussion leads to the following two lemmas:

Lemma 2

If ωv ≤ λ remove vertex v from graph G.

Lemma 3

If ωe ≤ λ remove vertex e from graph G.

The following relationship is a direct relationship, which means that one user is directly following the other user. However, such cEGS only preserve the direct social structures and ignore the most social information (i.e., the indirect social structures).

Indirect social structures also contain valuable information, especially when most systems can only access sampled data. Twitter data are usually incomplete and can only be accessed in a random fashion. If only direct relationships are utilized, most of the users are not following each other. To address this issue, we add weak tie relationships between users.
Fig. 3

Illustration of the idea of a weak tie

Weak tie is inspired from the triadic closure: if two people in a social network have a friend in common, then there is an increased likelihood that they will become friends themselves at some point in the future (Rapoport 1953). To fit the current social networking situation, we slightly modify friend relationship into following relationship. As an instance in Fig. 3, if a user is a fan of Jeremy Lin, to acquire his latest news, he will follow Jeremy online. In addition, Jeremy is a good friend to Chandler Parsons. The two of them follow each other naturally. Information from Chandler can be passed to this user via Jeremy. In this example, there exists a weak tie between this user and Chandler. We connect this weak tie plainly. The weak tie will not only introduce more edges to the graph but also show that even though these two users are not connected to each other, they belong to the same community through another directly connected user.

Definition 13

(Weak Tie) Let A, B, and C represent three vertices in a directed graph. A has a weak tie to C if and only if the edge from A to B and the edge from B to C exist, while the edge from A to C does not exist.

3.5.2 Analysis of cEGS

After constructing the cEGS, information trends are preserved in the corresponding graphs. Hence, the cEGS can be directly analyzed for event identification by utilizing the following measurements:
  1. 1.
    Number of vertices (\(V\)): the number of vertices for each graph in cEGS is calculated. This will represent the numbers of unique users that mentioned the corresponding event on a given day. Further use of this measurement can eliminate ads or spam left in the remaining dataset. Formally,
    $$V= \{ v_1, \ldots , v_n \},$$
    where vi represents the number of vertices in graph Gi on day i.
     
  2. 2.
    Number of edges (\(E\)): the number of edges for each graph in cEGS is calculated to represent the number of unique relationships between users who mentioned the corresponding keyword on a given day. Formally,
    $$E = \{ e_1, \ldots , e_n \},$$
    where ei represents the number of vertices in graph Gi on day i..
     
  3. 3.
    Number of connected components (\(C\)): for this measurement, the number of connected components for each graph in cEGS is calculated. This number represents the number of communities or groups of people who talked about the corresponding event. Formally,
    $$C = \{ c_1, \ldots , c_n \},$$
    where cj represents the jth connected component in graph Gi on day i..
     
  4. 4.

    Reciprocity (\(R\)): the last measurement, reciprocity, indicates the degree of mutuality in the network. Specifically, the purpose of this measurement is to detect the interaction between users and to perceive how the users engage in the topic.

     

Definition 14

(Reciprocity) Reciprocity is the ratio of all edges in a social relation graph over the possible number of edges in the graph. Two vertices are said to be related if there is at least one edge between them.

Reciprocity is a useful indicator of the degree of mutuality and reciprocal exchange in a network, which in turn relates to social cohesion. Higher reciprocity means greater interaction between users.

3.6 Event identification

Finally, the measurements explained in previous sections are employed to identify events on cEGS. Three types of events are identified:

3.6.1 Type-1 event: one-shot event

A type-1 event is an event that receives popularity in a short period of time. Usually, this type of event occurs suddenly without any warning or notification, such as the Japan tsunami in 2011 or a celebrity’s sudden death.

The characteristics of a type-1 event are
  1. 1.
    A significant increase in the number of unique users that mention the event: in order to measure this, V is used and compared one day to its previous days. A significant increase is measured by predefining threshold τV which represent the difference required for an event to be categorized as a type-1 event. A larger threshold value will result in fewer events detected as type-1 events, as it is a stricter value. Consider the following condition:
    $$\begin{aligned} {\mathbf {Type}}-{\mathbf {1\;Condition \;1}} \frac{V_i}{ V_{i-1}} > \tau _{V} \end{aligned}$$
    If type-1 condition 1 holds, there is a sudden increase in terms of the number of unique users.
     
  2. 2.
    Furthermore, to check whether information is propagated between users, the threshold \(\tau _{\mathrm{e}}\) is defined to indicate the ratio of relationship increases in the graph. This leads to the second condition:
    $$\begin{aligned} {\mathbf {Type}}-{\mathbf {1\;Condition\; 2}} \frac{E_i}{E_{i-1}} > \tau _{E} \end{aligned}$$
     
  3. 3.

    Attracting the general public, rather than a specific group of people: in order to measure this, C, which represents the number of groups of people described in the methodology section, is employed. This calculates how many connected components (i.e., groups of people or communities) mention the event to make sure that this event is important to people in general and is not limited to one community.

    As an example, a big sale takes place in a local area. While this event might be interesting to people living in that local area, it will not impact the general public. In order to impact the general public, threshold τc is defined, which represents the minimum ratio of connected components increased is defined.
    $$\begin{aligned} {\mathbf {Type}}-{\mathbf {1\;Condition\; 3}} \, \frac{c_i}{c_{i-1}} > \tau _{C} \end{aligned}$$
     
An event is categorized as type-1 if and only if that event has at least one time frame for which all three conditions are true.

3.6.2 Type-2 event: long-run event

This type of event is considered a long-run event because it requires time to propagate within a social network. A Type-2 event does not increase significantly, but attracts a lot of discussion among users. It might belong to a specific group of people in the beginning; people from this group engage in discussion about this event and then attract other users. This kind of event has the potential to become an important issue to a larger group of people beyond the initial group.

In order to identify a type-2 event, the reciprocity described in the methodology section is employed as the measurement. Formally,
$$R = \{ r_1, \ldots , r_n \},$$
where ri represents the reciprocity value for graph Gi on day i.

Instead of using a predefined threshold to measure how much reciprocity is required for an event to be categorized as a type-2 event, an increase in the reciprocity of one day in comparison with the previous day is compared to detect a significant increase. A larger threshold value will result in fewer events.

Type-2 Condition 1\(r_i > r_{i-1}* \tau _{\mathrm{r}},\)where\(\tau _{\mathrm{r}}\)represent the ratio required.

An event is categorized as type-2 event if and only if that event has at least one day wherein the condition above is fulfilled.

3.6.3 Non-event

The last identification is a Non-Event. An event is categorized as a Non-Event if it does not fall into the type-1 event or type-2 event categories.

4 Experiments

4.1 Experimental setup

Table 1 lists some parameters needed to operate the proposed system. Twitter was employed as the main experimental data source, and 280 million tweets from March, April, October, and November 2013 were crawled. In addition, 2.5 million users who posted tweets and their corresponding following relationships were collected. The dataset was divided into two categories, namely training data (tweets from March to April 2013) and testing data (tweets from October to November 2013). The training data was employed to adjust the thresholds needed in the system and apply the determined thresholds for the testing data. We compared each threshold’s accuracy and precision to each other, as a protocol, to determine the final threshold values that were considered in our work. Because the time period of the collected tweets was short, the sliding window size was fixed as 30 days. The implementation of the proposed system can be downloaded from https://osf.io/fhk9a/.
Table 1

Parameter values of the system

Parameter

Symbol

Value

Fixed

Decay value

\(\varpi\)

0.5

Yes

Decay threshold

\(\lambda\)

0

Yes

The performance measurement employed in the experiment was the precision metric. To obtain this metric, seven participants were presented with lists of extracted events for each day in November 2013. Each event on the list was represented by its core keyword set. The participants not only indicated whether they considered each event as an actual one by specifying 0 for no and 1 for yes, but also provided the explanations of their evaluation. These binary results were employed to calculate the precision value for each day in November 2013.

In order to fairly evaluate the performance of the proposed technique, we implemented the TimeUserLDA method proposed by Diao et al. Diao et al. (2012). TimeUserLDA can be considered the enhancement of the latent Dirichlet allocation (LDA) model (Wang et al. 2007) and can significantly increase the precision of identified events. To demonstrate the performance of our cEGS technique, we compared the performances of cEGS with different aspects of TimeUserLDA.

4.2 Experimental results

4.2.1 Damping factor

The damping factor used in TextRank Mihalcea and Tarau (2004) is an important parameter that influences the performance of keyword ranking.

In order to find the best damping factor value for our event detection approach, we performed an evaluation, which shows the qualities of keyword ranking for different damping factors. We used TextRank to rank the daily top ten keywords of twitter data from 1st Nov, 2013 to 30th Nov, 2013. Each keyword has a score graded by its order. The first keyword is assigned a score of ten and the second keyword is assigned a score of nine, and it continues until the last keyword is assigned a score of one.

After this, we score each keyword’s corresponding event that was detected by our method. The corresponding events were scored by how many tweets mentioned those events. By multiplying the keyword score and their corresponding event score, we get the ranking score of one damping factor. The higher ranking score implies better ranking quality. The experimental result is shown below in Fig. 4. It shows that the ranking quality of TextRank is best when damping factor is 0.85.
Fig. 4

Damping factor performance

4.2.2 Predefined event

To begin, the experiments are conducted for a predefined event. The predefined event is “Boston Marathon Explosion.” The incident of the Boston Marathon explosion occurred on April 15, 2013; hence, 30-days (one month) of data are selected from April 1, 2013 to April 30, 2013, which covers up to 70 million tweets. The system successfully identifies this event by returning a type-1 event on April 15 for the event, where the core keyword set is “library jfk blast bostonmarathon prayforboston pd possible area boston police social explosion bomb bombing marathon confirm incident.”
Fig. 5

April 15th Boston marathon explosion

Figure 5 shows the corresponding statistics for the event “Boston Marathon Explosion” within the time frame of April 2013. As illustrated, the number of unique users, following edges in \(cEGS,\) and the number of communities created for the event suddenly increased on April 15, 2013. Thus, the event is categorized as a type-1 event without any surprise. Noticeably, the reciprocity scores based on weak tie convey more information. They show that before the Boston Marathon explosion occurred, some subset of core keywords (such as Boston Marathon) were already receiving a lot of attention. This observation is not surprising since the Boston Marathon is an annual activity and ranked as one of the world’s best-known racing events.

Moreover, the reciprocity scores indicate that people were still concerned with this event after the actual event occurred. The discussions exchanged between users intensively increase from April 19 to 22, and from April 27 to April 30. Accordingly, two major developments become apparent by checking the timeline records from Wikipedia. On April 18 at 5 pm, the FBI released pictures of suspects in connection with the Boston Marathon bombings, and the suspects were founded on April 19. Immediately following the release of pictures, law enforcement agents began collecting evidence of the suspects. On April 26, the backpack of the leading suspect was recovered. On May 1, three friends of the leading suspect were charged. The trend of reciprocity scores of cEGS perfectly aligns with those developments. This demonstrates the reciprocity scores of cEGS can accurately measure the attention an event received in terms of interaction.

To further verify the type-1 event, a second experiment was conducted on the predefined event “Margaret Thatcher’s Death.” Margaret Thatcher died on April 8, 2013. The system detects that on April 8 2013, there was a type-1event correlated to Margaret Thatcher’s Death with the core keyword set “former prime minister rip uk margaret thatcher spokeswoman news death cameron cnn iron lady stroke coverage female leader pm briton 87 britain.”
Fig. 6

April 8th Margaret Thatcher death

Figure 6 illustrates the number of users and the corresponding reciprocity scores from April 6 to April 18, 2013. The figure shows the number of users rising sharply on April 8 due to the news of Thatcher’s death coming out on this day. Following news of the event, the number of users goes down gradually. Conversely, the reciprocity score on April 8 is close to the one from the previous day. This clearly indicates this is a type-1 event.

Interestingly, the reciprocity scores gradually rise again after April 12, 2013. This can be explained as users not only mourning the event of Thatcher’s death, but also engaging in longer-term discussion about her and related political issues. The discussion continued until April 17, which was the date of Thatcher’s ceremonial funeral.
Fig. 7

Comparison of the precision of our method with and without using cEGS

4.2.3 Comparison with/without cEGS analysis

Next, we attempted to compare the performances with/without cEGS analysis, as shown in Fig. 7. Before applying the cEGS analysis, the system extracted a total of 85,935 event candidates out of 23,763,138 tweets from April 2013. After applying cEGS analysis, only 7,395 events remained.

Since the quantity of the extracted event candidates and events was too large to be evaluated manually, we only evaluated the sampled dataset. In April 2013, we randomly sampled 100 events from five randomly selected days and manually evaluated the corresponding precision. As revealed by Fig. 4, the precision of extracted events improved significantly after applying the cEGS analysis (usually, the improvement rate was above 200 %). This result demonstrates that cEGS analysis was essential in identifying the events, since the cEGS information contains the concept of information flows in the social networks.

In practice, Twitter cannot release a completed dataset (i.e., only sampled data can be obtained). Under this circumstance, the obtained data might be biased (e.g., affected by robots, abnormal behaviors, or spam). Furthermore, the number of obtained tweets for different days usually varied. For example, our crawler obtained \(900\) K tweets on April 23rd, but only \(500\) K tweets the next day. The accuracy of the obtained events might be seriously affected by those biases if the propagation behaviors are not considered through cEGS analysis. With cEGS, however, our proposed technique was able to efficiently reduce the influence of bias.

Next, the performance of the system is measured as applied to the automatic identification of events. Figure 8 shows the precision value for each day of November 2013.
Fig. 8

Precision (participants)

Fig. 9

Precision (Amazon Mechanical Turk: Evaluation 1)

Fig. 10

Precision (Amazon Mechanical Turk: Evaluation 2)

In Fig. 8, we can see that most events were identified as being authentic by our participants and precision on most days is over 80 %. In fact the average precision for the entire month of November 2013 was 86.64 %. Days such as November 2, 4, and 27 showed low levels of precision, but some of these results can be explained as follows:

For instance, on November 2, words such as “thunder heat” and “wade heat” were not identified as events by our participants because of their strict standards. These keywords represent American basketball teams and players from those teams. These teams are both top-ranked teams, which make them more popular on a national stage and both had played games on November 1, 2013. However, our participants cannot recognize them as events for two reasons: (1) These teams have no game on November 2; and (2) The extracted core keyword sets are not specific enough to link with the game on November 1.

Other days with low precision such as November 4 are legitimately lacking actual events. Keywords such as “exam,” “monday,” “study,” “econ,” “paper,” “chem,” and “math” were identified as one event in the system but are too general to be classified as an event by our participants. Table 2 illustrates other inaccurate events and the reasons provided by our participants.

Furthermore, we performed an additional experiment using AMT that included two hundred participants. The participants were asked to verify whether each set of keywords represents an event and annotate whether they agree or not. Each event was checked by at least three experts. A tiny part of the experiment was lost due to some participants who did not manage to finish their task completely. We utilized the results from AMT to perform two different types of evaluations as follows:

Evaluation 1: In this evaluation, the participants annotated whether the event is true or not. Here an event is true if and only if all the participants annotated it as a true event. Likewise, an event is false if and only if all participants verified it as a non-event. Events’ correctness cannot be determined if the annotation from each participant is not monotonous. Figure 9 shows the result of this evaluation. From this figure we can observe that the results obtained slightly increase when compared to the previous evaluation as shown in Fig. 8.

Evaluation 2: In this evaluation, an event is true if and only if the participants who annotated this event as true are greater than participants who annotated the event as false. Likewise, an event is false if and only if participants who annotated an event as true are less than participants who verified it as false. Events’ correctness cannot be determined if the number of true and false annotations from every participant are equal. Figure 10 shows the result of this evaluation.
Table 2

Reason for incorrect event

Date

Event

Reason for incorrect

Nov 4

Biology chem monday study econ paper math test chemistry homework

Usually appear on Mondays. Consider as personal activities

Nov 4

Study motivation monday homework

Usually appear on Mondays. Consider as personal activities

Nov 14

Throwback tbt wcw

“ThrowbackThursday, Women crush Wednesday,” it usually appears on Thursdays and Wednesdays

Nov 16

2-0 England France

England and France do not have game that day

Nov 18

Steelersnation steeler

Sport event about football which is a competition between Lions and Steelers. The news can be seen from the website: http://www.sbnation.com/fantasy/2013/11/17/5107994/fantasy-football-week-11-lions-steelers-matthew-stafford-calvin-johnson-ben-roethlisberger

Nov 22

Rebecca achieve

A person cannot related news at this date

Nov 27

Photo photography

According to Google trend, many celebrities post thanksgiving photos on this day. This event is not specific enough

Nov 27

Ft youtube prod feat ft.

According to Google trend, Sharkeisha gain attentions for the closed-fisted sneak attack at this day. However, the identified event is not specific enough to relate to it

For the candidate events which were positively corroborated by our participants, we assigned one of seven generalized categories to each event. Our results can be seen in Fig. 11 below:
Fig. 11

Event category analysis

Here we see that over 60 % of the corroborated events are sporting events. So, based on our experiments, sports fans can be generalized as being the most active community members on Twitter. These members seem to tweet and discuss their ideas more often and in much greater numbers than any other group. Among the other groups entertainment/showbiz seems to be the second largest group but is still very small compared to the size of sporting activity discussions. In fact, all other categories combined still fall short of competing with the level of attention given to sporting events. Therefore, we believe it is essential to identify events using different standards for different categories.
Fig. 12

Comparison of the precision of our method and TimeUserLDA

4.2.4 Comparison with TimeUserLDA

Next, we compared our proposed method with the TimeUserLDA method (Diao et al. 2012). TimeUserLDA identified 45 bursty topics in total from 23,763,138 tweets during April 2013. In contrast, our method identified 7,395 events from the same dataset. In order to fairly compare the recall rates, we ranked the 7,395 identified events based on the rising rates as compared to the previous month and selected the same number of events (i.e., 45 events) to be compared.

Figure 12 reveals the precision of the two methods. The precision rates of our method and TimeUserLDA in the top 45 events were 91 and 98 %, respectively. Our methods precision seems to be lower than that of TimeUserLDA. However, as mentioned above, TimeUserLDA was only able to retrieve 45 events, while our method identified more than this. The precision of our top 100 events was higher than 86 %. This demonstrates that the diversity of events in our method is higher compared to TimeUserLDA.

Besides our subjective evaluation derived from the results obtained on AMT, we included an objective evaluation leveraging top rising queries from Google Trend.

We use the top rising queries extracted from Google Trends to further compare the diversity of the events and recall rates. The top rising queries from Google Trends can be classified into three types, namely break out events, 500 % rising events, and all in April 2013. This comparative procedure was conducted automatically using the edit distances. An identified event was considered correct as long as the edit distance was less than 30 % of the word length.
Fig. 13

Recall rates of the breakout events

Figure 13 compares the recall rates for the breakout events. As revealed, both techniques exhibit recall rates of 100 % for the hot topic events. In other words, both techniques can identify the bursty and hot topic events well. However, our method achieves 47 %, which is in excess of the 37 % recall rate obtained by TimeUserLDA, for overall events.
Fig. 14

Recall rate comparison

Figure 14 demonstrates that our method significantly outperformed the TimeUserLDA method for different circumstances. Among 75 Google top 500 % rising queries (where different queries might match to a single event), our method was able to match 34 queries in the top 45 while TimeUserLDA could only match 22 queries. Among a total of 307 Google trend queries, our method was able to match 90 queries in the top 45, while TimeUserLDA could only match 57 queries.

These experimental results demonstrate that our proposed cEGS technique outperforms the TimeUserLDA method by an improvement of at least 27 % over the TimeUserLDA method for the top 30 breakout events and up to 58 % on overall events. These experimental results demonstrate that our cEGS method can not only identify the same bursty events as the TimeUserLDA method, but also identify more types of events.
Fig. 15

Computation complexity of our method and TimeUserLDA

Next, we compared the scalability performance between different sizes of data with TimeUserLDA, as shown in Fig. 15. Moreover, as illustrated in Fig. 15, our method required about 7,000 seconds to complete the overall processes on a dataset containing 23,763,138 tweets and with 770,000 users. The TimeUserLDA method required about 60,000 seconds to complete the overall processes on a dataset that contained 23,763,138 tweets and included 150,000 users. Note that the number of users involved in our cEGS method was five times higher than that in TimeUserLDA, while our cEGS method only required 11.6 % of the time. This result demonstrates that our proposed method can be adopted in practice.

Finally, we assess the ability of the proposed method to identify “long-term” and “one-shot” events and gage the effects of weak tie relationships on reciprocity. A one-shot event identified by our method was day light savings time which took place on November 3, 2013. As indicated by Fig. 16, the use of the weak tie relationships provides much more insight into reciprocity than the direct relationships, and thus aided greatly in our identification of this event.
Fig. 16

November 3rd daylight saving

Figure 17 demonstrates that we are able to identify the long-term event of a British memorial period for war veterans, which commences at the start of November each year and lasts through the month. Some of the more popular keywords that were related to this long-term event were “sacrifice soldier sunday church remembrance silence.” Using only direct relationships we were able to adequately identify this event, but when taking into account weak tie relationships in the calculation of reciprocity the identification became overwhelmingly obvious, as illustrated in Fig. 17. This leads us to believe that weak tie relationships are a very important feature. Figure 17 shows that the weak tie relationships amplify the reciprocity and thus provide even more support for our method. Moreover, the figure also indicates that the performance of our proposed technique could be affected by biased samples, such as trends from November 13-14 and 19-21. Since we can only collect random sampled tweets from Twitter, it is possible that the sampled tweets might not contain certain events or that data obtained from social networks might not be completed. The weak tie relationships can lessen concerns related to incompleteness and bias.
Fig. 17

November 11th remembrance day

5 Conclusion and future work

In this work, we introduced cEGS for event identification and demonstrated how social structures can be utilized for event identification in social streams. Our method could not only be used to identify events, but also to categorize the events into one-shot and long-term events. Indeed, 80 % of the extracted events were evaluated as accurate by the participants.

Currently, the parameters utilized in the framework are manually calibrated based on the sizes of the training dataset. In our future work, we would like to adapt machine learning techniques to automatically tune the values of the parameters. We would also like to introduce different types of events based on social community detection. Moreover, as social networks become more and more significant in our daily life, there will be a growing opportunity to utilize and implement cEGS to solve other social network related problems, such as sentiment analysis.

References

  1. Allan J (2002) Topic detection and tracking: event-based information organization., The information retrieval seriesSpringer, BerlinCrossRefGoogle Scholar
  2. Alvanaki F, Sebastian M, Ramamritham K, Weikum G (2011) Enblogue: emergent topic detection in web 2.0 streams. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, ACM, p 1271–1274Google Scholar
  3. Alvanaki F, Michel S, Ramamritham K, Weikum G (2012) See what’s enblogue: real-time emergent topic identification in social media. In: Proceedings of the 15th international conference on extending database technology, ACM, p 336–347Google Scholar
  4. Bakshy E, Rosenn I, Marlow C, Adamic LA (2012) The role of social networks in information diffusion. In: Proceedings of World Wide Web, p 519–528Google Scholar
  5. Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on twitter. In: Proceedings of international AAAI conference on weblogs and social media, p 438–441Google Scholar
  6. Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on machine learning, ACM, p 113–120Google Scholar
  7. Broecheler M, Shakarian P, Subrahmanian V (2010) A scalable framework for modeling competitive diffusion in social networks. In: Proceedings of the IEEE second international conference on social computing (SocialCom), IEEE, p 295–302Google Scholar
  8. Cataldi M, Di Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the tenth international workshop on multimedia data mining, ACM, MDMKDD ’10, p 4:1–4:10Google Scholar
  9. Diao Q, Jiang J, Zhu F, Lim EP (2012) Finding bursty topics from microblogs. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers-Volume 1, Association for Computational Linguistics, p 536–544Google Scholar
  10. Dou W, Wang X, Skau D, Ribarsky W, Zhou MX (2012) Leadline: interactive visual analysis of text data through event identification and exploration. In: Proceedings of IEEE conference on visual analytics science and technology (VAST), IEEE, p 93–102Google Scholar
  11. Du Y, He Y, Tian Y, Chen Q, Lin L (2011) Microblog bursty topic detection based on user relationship. In: Proceedings of the 6th IEEE Joint international conference on information technology and artificial intelligence (ITAIC), IEEE, vol 1. p 260–263Google Scholar
  12. Fung GPC, Yu JX, Yu PS, Lu H (2005) Parameter free bursty events detection in text streams. In: Proceedings of the 31st international conference on very large data bases, VLDB endowment, p 181–192Google Scholar
  13. Gottron T, Radcke O, Pickhardt R (2013) On the temporal dynamics of influence on the social semantic web. In: Springer proceedings in complexity on semantic web and web science, Springer, p 75–87Google Scholar
  14. Granovetter M (1973) The strength of weak ties. Am J Sociol 78(6):1360–1380CrossRefGoogle Scholar
  15. Guzman J, Poblete B (2013) On-line relevant anomaly detection in the twitter stream: an efficient bursty keyword detection model. In: Proceedings of the ACM SIGKDD workshop on outlier detection and description, ACM, p 31–39Google Scholar
  16. Hong L, Ahmed A, Gurumurthy S, Smola AJ, Tsioutsiouliklis K (2012) Discovering geographical topics in the twitter stream. In: Proceedings of the 21st international conference on World Wide Web, ACM, p 769–778Google Scholar
  17. Kumaran G, Allan J (2004) Text classification and named entities for new event detection. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, ACM, p 297–304Google Scholar
  18. Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of World Wide WebGoogle Scholar
  19. Kwan E, Hsu PL, Liang JH, Chen YS (2013) Event identification for social streams using keyword-based evolving graph sequences. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, ACM, ASONAM ’13, p 450–457Google Scholar
  20. Ma H, Wang B, Li N (2012) A novel online event analysis framework for micro-blog based on incremental topic modeling. In: Proceedings of the 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel & distributed computing (SNPD), p 73–76Google Scholar
  21. Mathioudakis M, Koudas N (2010) Twittermonitor: trend detection over the twitter stream. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, ACM, p 1155–1158Google Scholar
  22. Mihalcea R, Tarau P (2004) Textrank: bringing order into texts. In: Proceedings of EMNLP 2004, association for computational linguistics, p 404–411Google Scholar
  23. Naaman M, Boase J, Lai CH (2010) Is it really about me?: message content in social awareness streams. In: Proceedings of the 2010 ACM conference on computer supported cooperative work, CSCW ’10, p 189–192Google Scholar
  24. Ohsawa Y, Benson NE, Yachida M (1998) Keygraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In: Proceedings IEEE international forum on research and technology advances in digital libraries (ADL), IEEE, p 12–18Google Scholar
  25. Petrovic S, Osborne M, McCreadie R, Macdonald C, Ounis I, Shrimpton L (2013) Can twitter replace newswire for breaking news? In: Proceedings of the seventh international AAAI conference on weblogs and social media, The AAAI PressGoogle Scholar
  26. Popescu AM, Pennacchiotti M (2010) Detecting controversial events from twitter. In: Proceedings of the 19th ACM international conference on information and knowledge management, p 1873–1876Google Scholar
  27. Pratt SF, Giabbanelli PJ, Mercier JS (2013) Detecting unfolding crises with visual analytics and conceptual maps emerging phenomena and big data. In: Proceedings of the IEEE international conference onintelligence and security informatics (ISI), IEEE, p 200–205Google Scholar
  28. Rapoport A (1953) Spread of information through a population with socio-structural bias: I. Assumption of transitivity. Bull Math Biophys 15(4):523–533MathSciNetCrossRefGoogle Scholar
  29. Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on World Wide Web, p 851–860Google Scholar
  30. Sankaranarayanan J, Samet H, Teitler BE, Lieberman MD, Sperling J (2009) Twitterstand: News in tweets. In: Proceedings of the 17th ACM SIGSPATIAL international conference on advances in geographic information systems, p 42–51Google Scholar
  31. Sayyadi H, Hurst M, Maykov A (2009) Event detection and tracking in social streams. In: Proceedings of international AAAI conference on weblogs and social mediaGoogle Scholar
  32. Seo E, Mohapatra P, Abdelzaher T (2012) Identifying rumors and their sources in social networks. In: Proceedings of the SPIE conference on defense, security, and sensing, p 83891IGoogle Scholar
  33. Shakarian P, Simari GI, Callahan D (2013) 29th internatioal conference on logic programming (ICLP-13) (tech.communication), Istanbul, Turkey, 24–28 Aug 2013Google Scholar
  34. Shuyo N (2010) Language detection library for java. http://code.google.com/p/language-detection/. Accessed 10 Dec 2013
  35. Twitter (2012) Twitter turns six. http://blog.twitter.com/2012/03/twitter-turns-six.html. Accessed 10 Dec 2013
  36. Valkanas G, Gunopulos D (2013) How the live web feels about events. In: Proceedings of the 22nd ACM international conference on information & knowledge management, ACM, p 639–648Google Scholar
  37. Wang X, Zhai C, Hu X, Sproat R (2007) Mining correlated bursty topic patterns from coordinated text streams. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, p 784–793Google Scholar
  38. Wasserman T (2012) Twitter says it has 140 million users. http://mashable.com/2012/03/21/twitter-has-140-million-users/. Accessed 10 Dec 2013
  39. Weng J, Lee BS (2011) Event detection in twitter. In: Proceedings of the international conference on weblogs and social mediaGoogle Scholar
  40. Zacks JM, Tversky B (2001) Event structure in perception and conception. Psychol Bull 127:3CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Wien 2015

Authors and Affiliations

  • Yi-Shin Chen
    • 1
  • Yi-Cheng Peng
    • 1
  • Jheng-He Liang
    • 1
  • Elvis Saravia
    • 1
  • Fernando Calderon
    • 1
  • Chung-Hao Chang
    • 1
  • Ya-Ting Chuang
    • 1
  • Tzu-Lung Chen
    • 1
  • Elizabeth Kwan
    • 1
  1. 1.Department of Computer Science, Institute of Information Systems and ApplicationsNational Tsing Hua UniversityHsinchuTaiwan

Personalised recommendations