Keywords

1 Introduction

Social media serve as important social sensor to capture the zeitgeist of the society. With real-time, online, asynchronous message-sharing supporting text, audio, video and images, these platforms offer valuable opportunities to detect events such as natural disasters and terrorist activity in a timely manner. Twitter is certainly a leader among microblogging platforms. With over 1.3 billion users and about 500 million messages (called tweets) posted per day and over 15 billion API calls per day, Twitter provides a massive source to detect event occurrences as they happen. Tweets are restricted to be maximum 280 characters in length and it is this design choice that enables information sharing extremely fast and in real-time. Owing to this scale and speed of information updates, Twitter is at the focus of attention to be monitored for new event detection. Also, Twitter’s social network features allow user interactions which helps further in gaining an insight on an event’s reception by people (by analysing their sentiments/opinions expressed in their tweets).

While Twitter analytics in general may entail a comprehensive multi-modal approach, for example, to harness relevant information from media (images, videos, audio), we argue that since the textual part of the message is authored by the creator of the message, it is more reliable, authentic, and credible personal source to gain insights on the information shared. However, while analyzing language is as such challenging because of the inherent lack of structure in its expression, Twitter exacerbates this by supporting short message text, slang and emojis, user-created meta-information tags (hashtags), and URLs. Traditional methods in natural language processing (NLP) were designed to work with large discourses and perform poorly on social media text. This is owing to the following:

  • Tweets are short mainly because of the restrictions imposed by the underlying platform (i.e. Twitter), but sometimes also under the influence of the cultural norms of the cyberspaceFootnote 1;

  • Tweets are often ungrammatical data type. Indeed, under the restriction of typing short messages from a mobile phone, grammar often takes a back seat. Also, because of the growth of social media encompassing the global scale, significant number of users are non-native speakers of English;

  • Tweets are highly contextual, i.e., the message cannot be understood without the context – geopolitical, cultural, topical (current affairs), and even conversational context (e.g. replies to previous messages);

  • Tweets use a lot of slang and emojis; exacerbating this, new slang words and emojis are invented and popularized every day and their meanings are volatile.

Indeed, tweets contain polluted content [18], and rumors [6], which negatively affect the performance of the event detection algorithms. Besides, only very few tweets actually carry a message about a newsworthy event [13]. These limitations motivate the current work, which aims to develop an enhanced Twitter event detection system that integrates external sources of information such as DBPedia [4] and WordNet [7]; exploits social network features of Twitter; and integrates knowledge obtained from the annotated resources (such as URLs cited in the tweets pointing to external web pages).

Topic Detection and Tracking (TDT) project [2] defines an event as “some unique thing that happens at some point in time”. [5] defines an event as “a real-world occurrence e with an associated time period \(T_e\) and a time-ordered stream of Twitter messages \(M_e\) of substantial volume, discussing the occurrence and published during time \(T_e\)”. We use these definitions as a guideline in developing our system of event detection from Twitter. We exploit specific features of Twitter that allow users to share others’ tweets by re-tweeting them, quoting them, liking them; allows users to embed hashtags, URLs.

Inspired by the existing event detection methods for detecting and tracking event-related segments [10, 19, 31], our system is based on identifying important phrases from individual tweets and creating a temporal profile of these phrases to identify if they are bursty enough to signify an occurrence of an event. However, our system differs from the state-of-the-art in the following:

  • We utilize DBPedia to efficiently identify named entities of import.

  • We use WordNet to assist in identifying event-specific words and phrases.

  • We expand the URLs embedded in tweets to find important phrases from the titles of the web pages that these URLs point to.

  • We harness the meta-data from Twitter such as ‘quoted status’, ‘retweet’, ‘liked’, ‘in reply to’, and also, the social network of the Twitter user.

The rest of the paper is organized as follows. In Sect. 2, we discuss the work closely related to ours. Section 3 describes our system for Twitter event detection. Section 4 explains our experimental setup and results followed by discussion. Finally, in Sect. 5, we present our conclusions.

2 Related Work

Event detection from Twitter has been extensively studied in the past as evidenced by the rich body of work. [3, 14, 25, 30], among others. Especially, techniques for event detection from Twitter can be classified according to the event type (specified or unspecified events), detection method (supervised or unsupervised learning), and detection task (new event detection or retrospective event detection). However, most of the techniques described in the aforementioned surveys suffer from rigorous evaluation. On the other hand, a major acknowledged obstacle in measuring the accuracy and performance of an event detection methods is the lack of large-scale, benchmarked corpora. Some authors have created their own manually annotated datasets and made them available publicly [23, 26].

Our work focuses on unsupervised, unspecified event detection retrospectively from a large body of tweets. Among many approaches to event detection from Twitter such as keyword volume approach, topic modeling, and sentiment analysis based methods, our work is based on keyphrase/segment detection and tracking which aim to identify keyphrases/segments whose occurrences grow unusually within the corpus [8, 9, 11, 12, 16, 17, 29]. Some of the most related works to ours are by [1, 19, 27, 31].

EDCoW [31] proposed a three-step approach. First, a wavelet transform and auto-correlation are applied to measure the bursty energy of each word and words associated with high energies are retained as event features. Then, they measure the similarity between each pair of event features by using cross correlation. At last, modularity-based graph partitioning is used to detect the events, each of which contains a set of words with high cross correlation.

[19] presented a system called Twevent that analyzes the tweets by breaking them into non-overlapping segments and subsequently identifying bursty segments. These bursty segments are then clustered to obtain event-related segments.

[23] contributed a large manually labeled corpus of 120 million tweets containing 506 events in 8 categories. They used Locality Sensitive Hashing (LSH) technique followed by cluster summarization and employed Wikipedia as an external knowledge source.

[1] employed a statistical analysis of historical usage of words for finding bursty words – those with burstiness degree above two standard deviation from the mean are selected clustered. However, their method was used to find localized events only.

[10] proposed mention-anomaly based approach incorporating social aspect of tweets by leveraging the creation frequency of mentions that users insert in tweets to engage discussion. [22] advocated the importance of named entities in Twitter event detection. They used a clustering technique which partitions tweets based upon the entities they contain, burst detection and cluster selection techniques to extract clusters related to ongoing real-world events.

Recently, [28] employed extracting a structured representation from the tweets’ text using NLP, which is then integrated with DBpedia and WordNet in an RDF knowledge graph. Their system enabled security analysts to describe the events of interest precisely and declaratively using SPARQL queries over the graph.

3 Our System

Our system, Metadata-assisted Twitter Event Detection (MaTED) is an extension of a previous work Twevent [19]; however, our system makes use of several other features of Twitter ignored in previous research.

Figure 1 shows the architecture of MaTED which consists of four components: i) detection of important phrases from tweets; ii) creating temporal profiles of these phrases to identify bursty phrases; iii) clustering bursty phrases with an aim to group related phrases about an event, and iv) characterizing an event from the clusters obtained above. We parse the tweet JSON object after receiving it from a stream, and the first component of our system identifies important segments/phrases not just from the tweet text but also from the titles of the webpages that URL links in the tweet points to. Since event-related phrases are mostly named entities, we harness the DBPedia to extract such phrases. The resultant phrases along with tweet timestamps are then fed to the next component of our system which estimates their burstiness behavior using statistical modeling of their occurrence frequency. Subsequently, we group the event-related phrases using a graph-based clustering algorithm. In the rest of this section, we describe each component in detail following the order of their usage in our framework.

Fig. 1.
figure 1

MaTED system architecture

3.1 Identifying Important Phrases from Tweets

In this component, we parse the tweet JSON object to obtain tweet text, hashtags, URLs, user mentions and other available metadata. We then create a set of phrases/keywords to be monitored consisting of the following items:

  • List of named entities obtained after inputting the tweet text (after pre-processing and cleaning) to DBPedia Spotlight [24] web service.

  • List of wordings related to action or activity present in the original tweet message. These are identified as words that are either a direct or indirect hyponym of ‘event.n.01’ synset of WordNet.

  • List of hashtags included in the tweet.

  • For each URL that is cited in the tweet, we obtain the title text of the web page that URL is pointing to. We submit this title text to our locally running DBPedia Spotlight web service to obtain named entities and include these into the list of items to be monitored.

  • For each user mention, we include the ‘name’ the user mention handle is associated to.

Fig. 2.
figure 2

An example of phrase extraction from tweets.

Figure 2 illustrates the overall process. The tweet text “Overall fatalities caused by the disease rose to 9053 from 8189 on Tuesday. The daily death toll reached a record 864 in #Spain.” does not mention coronavirus which is important entity for event detection, tracking and monitoring. Our system fetches and processes the title of the webpage linked to the URL cited in the tweet (http://bit.ly/39xOhjZ). Concurrently, finding words from the tweet text that are direct or indirect hyponyms of the event.n.01 synset of the WordNet finds important words to track/monitor (‘cause’, ‘reach’, and ‘record’).

3.2 Extracting Bursty Phrases

After creating a set of phrases from the dataset as indicated above, using a model proposed by Twevent [19], we find bursty phrases potentially indicative of an event. However, our model includes several other factors in finding burstiness score of a phrase than considered in [19]. Below we outline our method.

Let \(N_t\) denote the number of tweets published within the current time window t and \(n_{i,t}\) be the number of tweets containing phrase i in t. The probability of observing i with a frequency \(n_{i,t}\) can be modeled by a Binomial distribution \(B(N_t , p_i)\) where \(p_i\) is the expected probability of observing phrase i in a random time window. Since \(N_t\) is very large in case of Twitter stream, this probability distribution can be approximated by a Normal distribution p with parameters \(E[i|t] = N_t \times p_i\) and \(\sigma (i|t) = \sqrt{ N_t \times p_i \times (1 - p_i)}\).

We consider a phrase i as bursty if \(n_{i,t} \ge E[i|t]\). Using the formula for the burstiness probability \(P_b(i, t)\) for phrase i in time window t defined by [19]:

$$\begin{aligned} P_b(i,t) = S(10\times \frac{n_{i,t} - (E[i|t] + \sigma [i|t]}{\sigma [i|t]}) \end{aligned}$$
(1)

where \(S(\dot{)}\) is the sigmoid function, and since S(x) smooths reasonably well for x in the range [−10,10], the constant 10 is introduced.

In addition to finding the importance of a phrase based on how many times it was used in the given time window, we further assign various weight values based on who authored the phrase, how many times it was retweeted, quoted, liked, replied to. More formally, let \(u_{i,t}\) denote the number of distinct users authoring phrase i in time window t. Let retweet count of a phrase i in t be \(rt_{i,t}\) which corresponds to the sum of the retweet counts of all tweets containing i in t. Similarly, let \(l_{i,t}\) be the liked count, \(q_{i,t}\) be the quoted count, and \(rp_{i,t}\) be the ‘replied to’ count. Also, in order to assign an importance degree to the phrases used by those Twitter users who have a significant following, we assign the weight \(fc_{i,t}\) as the follower count which is the sum of the follower count of all users using phrase i in t. Incorporating all the above, the burstiness weight \(w_b(i, t)\) for a phrase i in t can be defined as:

$$\begin{aligned} w_b(i, t) = P_b(i,t) \cdot log(u_{i,t}) \cdot log(rt_{i,t}) \cdot log(l_{i,t}) \cdot log(q_{i,t}) \cdot log(rp_{i,t}) \cdot log(log(fc_{i,t})) \end{aligned}$$
(2)

After finding the burstiness weight for all phrases, the top K are selected in decreasing order of weights. Empirically, we find that decreasing K results in low recall, while increasing K brings in a significant noise. [19] suggest an optimal value of K is set to \(\sqrt{N_t}\).

3.3 Clustering Bursty Phrases

We adopt the approach by [19] without any modification to group bursty phrases to derive event-related clusters. Each time window is evenly split into M subwindows \(t={<}t_1, t_2, \ldots , t_M{>}\). Let \(n_t(i, m)\) be the tweet frequency of phrase i in the subwindow \(t_m\) and \(T_t(i, m)\) be the concatenation of all the tweets in the subwindow \(t_m\) that contain phrase i. The similarity \(sim_t(i_a,i_b)\) between phrases \(i_a\) and \(i_b\) in time window t is calculated as follows:

$$\begin{aligned} sim_t(i_a, i_b) = \sum w_t(i_a,m)w_t(i_b,m)\times sim(T_t(i_a,m),T_t(i_b,m)) \end{aligned}$$
(3)

where \(sim(T_t(i_a,m),T_t(i_b,m))\) is TF-IDF similarity between tweets \(T_t(i_a,m)\) and \(T_t(i_b,m)\); and \(w_t(i_a,m)\) is the fraction of frequency of segment \(i_a\) in the time subwindow \(t_m\) as calculated as follows:

$$\begin{aligned} w_t(i,m) = \frac{n_t(i,m)}{n_{(i,t)}} \end{aligned}$$
(4)

Using the above similarity measures, all the bursty phrases are clustered using a graph-based clustering algorithm [15]. In this method, all bursty phrases are considered as nodes and initially, all nodes are disconnected. An edge is added between phrases \(i_a\) and \(i_b\) if k-Nearest neighbors of \(i_a\) contain \(i_b\) and vice versa. All connected components of the resultant graph are considered as candidate event clusters. Each connected component is essentially a set of phrases which are related to a single event. Disconnected nodes (phrases) are discarded as non-significant.

3.4 Event Characterization

We characterize an event as a group of phrases associated to it. To visualize this, we adopt the approach by MABED [10]. An interface is designed to allow us to visualize the list of relevant tweets defining the event by ‘clicking’ on the event name.

4 Experiments and Results

We use the corpus collected by [23] Events2012 which contains 120 million tweets and 506 labeled events. These tweets were collected from Oct 10 till Nov 7, 2012 and were filtered to remove tweets containing more than 3 hashtags, 3 user mentions, or 2 URLs discarding them as spam [6]. However, not all tweets were available due to some users’ data being not available because of account inactivation, privacy mode setting changes, etc. Our final dataset contain \(\sim \)38 million tweets of which 127,356 are related to events. It should be noted that these tweets are limited to maximum 140 characters in length since the increased length (up to 280 characters) was introduced in late 2017. In Table 1 we show some of the important events in the Events2012 [23] dataset which have more than 1,500 tweets associated to them.

Table 1. Top events in Events2012 dataset. Each with more than 1500 tweets associated to it.

4.1 Pre-processing

We perform the following steps sequentially as part of preprocessing the tweet text and the titles of the webpages linked by the cited URLs in the tweet message:

  1. 1.

    We use the Stanford tokenizerFootnote 2 to tokenize the tweets.

  2. 2.

    Use of words like cooooolll, awesommmme, are sometimes used in tweets to emphasize emotion. We use a simple trick to normalize such occurrences. Namely, let n denote the number of such letters that have three or more consecutive occurrences in a given word. We first replace three or more consecutive occurrences of the same character with two occurrences. Then we generate \( {n} \atopwithdelims (){2} \) prototypes that are at edit distance 1 (only delete operation, deleting only repeated character) and look for this prototype in the dictionary to find the word. For example, coooooolllll \(\rightarrow \) cooll \(\rightarrow \) cool.

  3. 3.

    We use an acronym dictionary from an online resourceFootnote 3 to find expansions of the tokens such as gr8, lol, rotfl, etc.

After the pre-processing task, to obtain a list of named entities, we submit the text to the DBPedia Spotlight [24] web service. We chose DBPedia over other named-entity recognition software (such as Standford NER [21], OpenNLP [32], NLTK [20], etc.) because employing such tools yields phrases that induce noise in the resulting system.

4.2 Evaluation Metrics

As evaluation metrics, we used precision, recall, and DERate (duplicate event rate, proposed by [19]). Precision conveys the fraction of the detected events that are related to a realistic event. Recall indicates the fraction of events detected from the manually labeled ground truth set of events. However, if two detected events are related to the same realistic event within the same time window, then both are considered correct in terms of precision, but only one realistic event is considered in counting recall. Therefore, [19] defined a metric DERate to denote the fraction of events (in percentage values) that are duplicately detected among all events detected.

4.3 Baseline Methods

In order to evaluate our proposal, we compare our approach with closely related works: EDCoW [31], Twevent [19], NEED [22], and MABED [10]. Table 2 shows the comparative performance of our system to selected state-of-art approaches. For MABED, we modified their online available code to include hashtags, instead of user mentions to measure anomaly (ref as MABED+ht). Also, for our system, in order to observe the effect of WordNet words related to events, we conducted two sets of experiments where MaTED-WN is the system without using WordNet words. We share our source code and dataset used online.

4.4 Results and Discussion

Table 2 shows the results we obtained compared to the baseline method.

Table 2. Results on Events2012 dataset

Table 3 shows some of the events detected by MaTED that were not detected by any of the above systems.

Table 3. A sample of events that only our system could extract.

Several parameters impact the performance of the resulting system and the results shown in Table 2 are obtained by an optimal combination of them. It is evident from Table 2 that the performance of existing bursty segment detection based systems is enhanced by including social and Twitter-specific features incorporated in our system. Especially, we notice a significant improvement in recall by including title texts of the web pages pointed to by the URLs in the tweets. A tweet is often a comment on the web page that is shared and therefore, by including the title text, the system incorporates a better context for the tweet. Further, because of misspellings, DBPedia Spotlight sometimes fail to find the named entity and in such cases, tracking event specific words from WordNet (total 7878 words) helps identify an event. For example, in Table 3, the event on 16.10.2012 about Ford would be missed if the word ‘fault’ was not included in the list of important key-phrases to be considered. Better results are observed for MABED+ht as opposed to the original MABED owing to the fact that hashtags are better indicators of events than user mentions. We attribute our system’s less precision value than [19]’s system to including several more event-specific phrases from the web page titles, hashtags, and event-specific words from the WordNet resulting in a higher recall but at a slight loss of precision (0.79 as opposed to 0.81 of [19]). Finally, we also noticed that many events were not reported in the crowd-sourced ground-truth Events 2012 corpus. Event on 15.10.2012 about Amanda Todd suicide is one example of many events we found which were not included in the corpus.

5 Conclusion

A phenomenal growth in online social network services generate massive amounts of data posing a lot of challenges especially owing to the volume, variety, velocity, and veracity of the data. Concurrently, methods to detect events from social streams in an efficient, accurate, and timely manner are also evolving. In this paper, we build on an existing system Twevent [19] by incorporating external knowledge bases of DBPedia and WordNet together with exploiting user’s mentions and hashtags contained in Twitter messages for efficient event detection. In addition, harnessing the fact that a tweet is often just a remark/comment on the news/information shared in the URL cited in it, we improve event detection performance by detecting and tracking important event-related phrases from the titles of the web pages linked to the URLs. We examined the effect of adding our novel features incrementally and concluded that our model outperforms the state-of-the-art on the benchmarked Events2012 [23] corpus. Future research includes investigating usage of distributed semantics (e.g., word embeddings) incorporated in a larger framework of a deep learning inspired model towards achieving higher accuracy on event detection from a massive-scale collection of social media messages.