# Topic detection and tracking on heterogeneous information

- 1.3k Downloads
- 1 Citations

## Abstract

Given the proliferation of social media and the abundance of news feeds, a substantial amount of real-time content is distributed through disparate sources, which makes it increasingly difficult to glean and distill useful information. Although combining heterogeneous sources for topic detection has gained attention from several research communities, most of them fail to consider the interaction among different sources and their intertwined temporal dynamics. To address this concern, we studied the dynamics of topics from heterogeneous sources by exploiting both their individual properties (including temporal features) and their inter-relationships. We first implemented a heterogeneous topic model that enables topic–topic correspondence between the sources by iteratively updating its topic–word distribution. To capture temporal dynamics, the topics are then correlated with a time-dependent function that can characterise its social response and popularity over time. We extensively evaluate the proposed approach and compare to the state-of-the-art techniques on heterogeneous collection. Experimental results demonstrate that our approach can significantly outperform the existing ones.

## Keywords

Topic detection Heterogeneous sources Temporal dynamics Social response Topic importance## 1 Introduction

Social media, such as Twitter and Facebook, has been used widely for communicating breaking news, eye witness account, and even organising flash mobs. Users of these websites have become accustomed to receiving timely updates on important events. For example, Twitter was heavily used in numerous international events, such as the Ukrainian crisis (2014) and the Malaysia Airlines Flight 370 crisis (2015). From a user consumer’s perspective, it makes sense to combine social media with traditional news outlets, e.g., BBC news, for timely and effective news consumption. However, the latter have different temporal dynamics than the former, which entails a deep understanding of the interaction between the new and old sources of news.

Combining heterogeneous sources of news has been investigated by several research communities (cf. Section 2). However, existing works mostly merge documents from all sources into a single collection, and then apply topic modelling techniques to it for detecting common topics. This may cause a biased result in favour of the source with a high frequencies of publication. Furthermore, the heterogeneity that characterises each source may not be maintained. For example, the Twitter data stream is distinctively biased towards current topics and temporal activity of users, this means for effective topic modelling, computational treatment of user’s social behaviours (such as author information and the number of retweets) is also needed. Alternatively, running the existing topic models on each source can preserve the characteristics of each source, but make it difficult to capture a common topic distribution and the interaction among different sources. To solve the aforementioned problems, we present a heterogeneous topic model that can combine multiple disparate sources in a complementary manner by assuming a variable of common-topic distribution for both Twitter and news collections.

Furthermore, people would not only like to know the type of topic that can be found from these disparate data sources but also desire to understand their temporal dynamics, as well as the topic importance. However, the dynamics of most topic streams are intertwined with each other across sources such that their impact is not easily recognisable. To determine the impact of a topic, it is critical to consider the evolution of the aggregated social response from social media (i.e., Twitter), in addition to the temporal dynamics of news media. However, news media can have the news cycle (Tsytsarau et al. 2014) all by themselves while keeping a growth shape, such that the burst shape of publication does not always consistent with the beginning of the topic. Therefore, we used a deconvolution approach (a well-known technique in audio and signal processing (Kirkeby et al. 1998; Mallat 1999)) that can address these concerns by using a special compound function that considers both topic importance and its social response in social media.

In this study, we addressed the problem of *Topic Detection and Tracking* (TDT) from heterogeneous sources. Our study is based on recent advances in both topic modelling and information cascading in social media. In particular, we designed a heterogeneous topic model that allows information from disparate sources to communicate with each other while maintaining the properties of each source in a unified framework. For temporal modelling, we proposed a compound function through convolution, which optimally balances the topic importance and its social response. By combining these two models, we effectively modelled temporal dynamics from disparate sources in a principled manner.

The rest of the paper is organised as follows. Section 2 describes background and related work. In Sections 3 and 4, we discuss our model in details. Section 5 provides experimental results on two real-world datasets. We present our conclusion and future work in Section 6.

## 2 Related work

There are approaches that tackle sub-tasks of our problem in various domains, however, they cannot be combined to solve the problem that our model solves. To the best of our knowledge, there is no existing work that is capable of automatically detecting and tracking topics from heterogeneous sources while simultaneously preserving the properties of each source. The task of this paper can be loosely organised into two independent sub-tasks: (1) topic detection from heterogeneous sources (2) characterising their temporal dynamics. In this section, we will review these two lines of related work.

### 2.1 Topic detection from heterogeneous sources

One of the track tasks included in the Topic Detection and Tracking (TDT) (Fiscus & Doddington 2002) is topic detection, where systems cluster streamed stories into bins depending on the topics being discussed. The techniques of topic modelling, such as PLSA (Hofmann 2001) and LDA (Blei et al. 2003), have shown to be effective for topic detection. In the seminal work of online-LDA model (Alsumait et al. 2008), the authors update the LDA incrementally with information inferred from the new stream of data. Lau et al. (2012) subsequently demonstrated the effectiveness of Online-LDA for Twitter data streams.

However, there has not been extensive research on the development of topic detection from heterogeneous sources. Zhai et al. (2004) proposed a cross-collection mixture model to detect common topics and local ones respectively. The state-of-the-art approach (Hong et al. 2011) utilises the collection model for mining from multiple sources, together with a meme-tracking model that iteratively updates the hyper-parameter that controls the document-topic distribution to capture the temporal dynamics of the topic. In the Collection Model, a word belongs to either the local topic or the common topic, the probability of which is drawn from a Bernoulli distribution. Common topics are obtained by merging all documents from all sources into one single collection, and then apply LDA on it. Local topics are computed by using LDA on each source individually. But it assumes that there is no correspondence between local topics across different sources, nor is there exchanging information between the local topics and the common ones. While in real-world scenarios, information from multiple sources constantly interacted with each other as the topic evolves. To address this problem, Ghosh & Asur (2013) proposed a Source LDA model to detect topics from multiple sources with the aim to incorporate source interactions, which is somewhat similar to our idea. However, there are stark differences between their work and ours: First, their model assumes that there is no order for the documents in the collection, hence the temporal dynamics of each source is completely ignored. Secondly, since the same LDA model is applied over different sources, they didn’t exploit the properties that characterise each data source, whereas we use Author Topic model and LDA for Tweets and News sources respectively (cf. Section 3).

### 2.2 Characterizing temporal dynamics and social response

In the seminal work of dynamic topic modelling (Blei & Lafferty 2006), each topic defines a multinomial distribution over a set of terms. Therefore, for each word of each document, a topic is drawn from the mixture and a term is subsequently drawn from the multinomial distribution corresponding to that topic. This has led to the recent development of incorporating temporal dynamics into topic models (e.g., Wang et al. 2012; Hong et al. 2011; Masada et al. 2009; Wang and McCallum 2006; Dubey et al. 2013). These models enable us to gain insight into datasets with temporal changes in a convenient way and open future directions for utilizing these models in a more general fashion. However, their analysis was conducted on an academic dataset. To investigate the effectiveness of dynamic topic model, Leskovec et al. (2009) proposed a framework to capture the textual variants over different phrases. It is based on two assumptions for the interaction of Memes sources: imitation and recency. The imitation hypothesis assumes that news sources are more likely to publish on events that have already seen large volume of publications. The recency hypothesis marks the tendency to publish more on recent events. While effective, their work only focused on the temporal dynamics of one source, namely the news outlets, whereas in this paper two different types of sources are considered simultaneously.

In addition, while there are previous work investigate the temporal dynamics in a heterogeneous context (Hong et al. 2011), a deeper understanding of topic dynamics entails extraction of burst shapes and modelling of social response (Tsytsarau et al. 2014). This requirement is important as publication volume often contains background information, which may mask individual patterns of topics. Another important factor is topic importance, which is first proposed in Cha et al. (2010). The authors studied the influence within Twitter, and performed a comparison of three different measures of influence- indegree, retweets, and user mentions. They discovered that retweets and user mentions are better measure of topic influence than indegree. Unlike the previous studies, in this paper, we aim to incorporate both topic importance and social response into a unified topic modelling framework.

## 3 Heterogenous topic model

In this section, we will explain the heterogeneous topic model, which can detect topics in heterogeneous sources while preserving the properties of each source. Throughout the paper, we have used the general term “document” or “feed” to cover the basic text collections. For example, in the context of news media, a feed is a news article; whereas for Twitter, a feed represents a tweet message.

### 3.1 Model description

*i*of any source

*j*would correspond to topic

*i*of another source

*k*, where topic

*i*is conformed to the properties of the local source. With the notation given in Table 1, Fig. 1 illustrates the graphical model for HTM, which blends two topic modelling techniques, namely, Author Topic Model (ATM) (Rosen-Zvi et al. 2004) and Latent Dirichlet Allocation (LDA).

Notations used in HTM

Symbol | Description |
---|---|

| Number of words in the collection(Twitter and News) |

| Number of common words appeared in both collections |

| Vocabulary size |

| Number of topics |

| Number of authors in Twitters |

| Number of words assigned to the topic of a word |

| Number of words assigned to the topic of an author |

\(n_{d}^{(t)}\) | Frequency that topic |

to a word in document | |

\(n_{t}^{(w)}\) | Frequency that word |

\(w^{(k)}_{di}\) | Words in Twitter document |

\(w^{(n)}_{di}\) | Words in News document |

| Topic assignment |

| Topic assignment for word |

| Author assginments |

| Author assignment for word |

\(\mathcal {A}\) | Authors of the corpus in Twitters |

| Dirichlet prior for Twitter |

| Dirichlet prior of News document |

| Dirichlet prior for topic |

| Dirichlet prior |

| Dirichlet prior of word |

| Probabilities of words given on topics |

The various probability distributions we can learn from the HTM model characterise the different factors of each source that can affect the topics. For the generation of content, each word *w* ^{(n)} in news articles is only associated with a topic *z*, and each word *w* ^{(k)} in tweets is related to two latent variables, namely, an author *a* and a topic *z*. The communication between the topics of different sources is governed by a parameter, *η*, which represents the common topic distribution. *β* is the prior distribution of *η*. The observed variables include author names of tweets, words in the Twitters dataset, and words in the News dataset, the rest are all unobserved variables. Notice that the *D* in the left side of the figure represent the Twitter documents and the *D* in the right side of the figure is the News documents.

Conditioned on the set of authors from Twitters and its distribution over topics, the generative process of the HTM are summarized in Algorithm 1, where *v* *a* *r* *i* *a* *b* *l* *e* ^{(k)} represents the variable of Twitters and *v* *a* *r* *i* *a* *b* *l* *e* ^{(n)} denotes the variable of news. The local words represents the ones that only appeared in a single source, whereas the common words are the ones occurred in all the sources. Under the generative process, each common topic *z* in News and Twitters is drawn independently when conditioned on Θ and Φ, respectively.

*z*

_{−d i },

*w*

_{−d i }stand for the vector of topic assignments and word observations except for the

*i*

^{ t h }word of news document

*d*.

*z*

_{−d j },

*w*

_{−d j }stand for the vector of topic assignments and word observations except for the

*j*

^{ t h }word of tweet

*d*.

Note that, the sampled words from one collection may not be observed in another collection. In such cases, the prior probability of topic over word is set as one.

### 3.2 Model fitting via Gibbs sampling

*η*is a joint distribution of two mixtures. As a result, we need to compute the joint distribution of

*w*

^{(k)}and

*w*

^{(n)}in the Gibbs sampling process. The posterior distributions for Gibbs sampling in HTM are

*C*

^{ W T })

^{(n)}, (

*C*

^{ W T })

^{(k)}indicates the sample of topic-term matrix in which each term is observed only in Twitter and only in News respectively, (

*C*

^{ W T }) is the sample of topic-term matrix in which each word can be observed from the collections; \(\beta _{t}^{(t)}\) is the local topics that belong to Twitter. Since the Dirichlet distribution is conjugate to the Multinomial distribution, the posterior mean of \(\mathcal {A}\), Θ and Φ given

**x**,

**z**, \(\mathcal {\textbf {w}}\), \(\mathcal {D}^{\text {train}}\),

*α*

^{(n)},

*α*

^{(k)}and

*β*can be obtained as follows:

*s*refers to the sample from Gibbs sampler of the full collection. The posterior \(\mathcal {A}\), Θ and Φ, correspond to the author distribution on topics, the topic distribution on words, and the document distribution on topics respectively.

## 4 Modelling temporal dynamics

In this section, we review a temporal model for topics, introduced in Tsytsarau et al. (2014) and present an alternate derivation. We start with the description of the basic social response function and our representation of topic importance. Then we introduce a deconvolution approach over the time series of news and Twitter data, in order to extract important properties of topics.

### 4.1 Modelling impacting topics

*s*

*r*

*f*(

*t*) is the social response function; and

*e*(

*t*,

*r*), which reflects the importance of the topic during time

*t*, is the joint function of actual topic sequence

*t*and the number of retweets

*r*.

*e*(

*t*,

*r*) =

*l*

*n*(

*r*)(

*a*+

*b*)

*t*

_{0}−

*b*

*t*,

*t*>

*t*

_{0};

*e*(

*t*,

*r*) =

*l*

*n*(

*r*)

*a*

*t*,0 <

*t*<

*t*

_{0}.

*a*is buildup rate and

*b*is decay rate. The intuition behind

*e*(

*t*,

*r*) is that certain topics should have a better chance of being selected, since they had a higher popularity and a larger volume of social response. For example, during the ”heat wave action”, the hottest ever day of UK in twelve years, news articles and Twitter messages are more inclined to talk about weather, rather than politics. Furthermore, we observe that hot topics usually correspond to a higher retweet number. The form of (13) has been demonstrated to be effective for capturing spikes of news articles and Tweets (Hong et al. 2011).

*s*

*r*

*f*(

*t*). To model the shape of

*s*

*r*

*f*(

*t*), we propose to employ a family of normalized decaying functions (Asur et al. 2011) demonstrated in the following equations:

### 4.2 Topic deconvolution

*β*

_{ w }in Dirichlet parameter

*β*be associated with the following time-dependent function.

*f*

_{ k }(

*t*) is the deconvolution model described in Section 4.2. However, if we naively associate

*β*

_{ k }with

*f*

_{ k }, the model may consider the starting point of time

*t*for all topics is timestamp 0. In fact, different topics have different starting point

*t*

_{0}. Thus we modify it into the following form:

*t*

_{0}is the starting timestamp of the topic,

*q*

_{ k }indicates how quickly the topic would rise to the peak, and

*N*

_{ k }is the noise level of the topic. The absolute value function garantees that the time-dependent part is only active when

*t*

_{0}is larger than

*t*

_{0}. \(\mu (t-{t_{0}^{k}})\) is a boolean function that is 1 for

*t*>

*t*

_{0}and 0 otherwise. The intuition behind this equation is that the prior knowledge of each topic is fixed over time (by the “noise” level

*N*

_{ k }). The crux of the problem is to estimate the values of these 3 hyper-parameters from the data.

The procedures for parameter estimation is summarised in Algorithm 2. Generally, we integrate the deconvolution function with Gibbs Sampling into EM framework (e.g., similar to (Doyle & Elkan 2009)). In the E-step, we gather topic assignments and useful counts by Gibbs sampling using (3). In the M-step we optimise the proposed deconvolution functions to obtain the updated hyper-parameters for the next iteration. More specifically, the first step is to calculate the Dirichlet parameters *β* from word frequency observed from Gibbs Sampler. This can be done in several ways (Minka 2000). We use Newton’s method in this step, where The step parameter *ξ* can be interpreted as a controlling factor of smoothing the topic distribution among the common topics. The second step is to use these *β* values to fit the deconvolution function (19) and then, use the parameters from the fitted deconvolution function as initial values to fit our temporal dynamic function (19).

## 5 Experiment and results

- 1.
How to evaluate the performance of topic modelling techniques in the intertwined heterogeneous context?

- 2.
How effective is our proposed heterogeneous topic model compared to other topic models in the above context?

- 3.
Can temporal dynamics of each sources be exploited effectively to further enhance the performance of topic modelling?

### 5.1 Datasets and metrics

To construct two parallel datasets, we crawled 30 million tweet from 786,823 users from the first hour of April 1, 2015 to 720, the last hour of April 30, 2015 in a matter of one month in the region of Scotland. All these tweets follow one of the following rules, what we call, Scotland-related information centres. It consists of person names, places, and organisations that share information related to Scotland.

In addition, over the entire month, we crawled 224,272 unique resolved URLs. The News dataset is obtained through the Boilerpipe program.^{1} The total dataset size is 4GB and essentially includes complete online coverage: we have all mainstream media sites plus 100,000 million blogs, forums, and other media sites. All our experiments are based on these two datasets.

For each web page, we collected (1) The title of the web page. (2) Text content of the web page (after removing tags, js, etc.) (3) Tweets linking to the page. For both the Twitter and the news media datasets, we first removed all the stop words. Next, the words with a document frequency of less than 10 and words that appeared in more than 70% of the tweets (news feeds) were also removed. Finally, for Twitter data, we further removed tweets with fewer than three words and all the users with fewer than 8 tweets.

*w*

_{ d,i }represents the

*i*

^{ t h }word in feed

*d*. Note that the perplexity is defined by summing over the feeds.

*η*

_{ t w }, and

*T*is the total number of topics. The proposed topic perplexity can explain how well each topic model can predict an unseen topic. A low value of perplexity is an indicator of good performance for the topic model that is being evaluated.

Again, the smaller value of the entropy measure, the better are the topics since it indicates a better discriminative power.

- 1.
Source-LDA, which identifies the latent topics by leveraging the distribution of each of the individual sources. Notice that we use Twitter as the main source as it has reported in Ghosh & Asur (2013) that choosing Twitter as the main source can achieve an superior performance over other options.

- 2.
Heterogeneous Topic Model (HTM), the model we described in Section 3 which applies Author Topic Model (ATM) on Twitter dataset and LDA on News dataset, as a result the properties of each source can be preserved.

- 3.
Deconvolution Model (DM), which simply integrate the deconvolution model described in Section 4.2 into Author Topic Model (ATM) of Twitter dataset.

- 4.
Heterogeneous Topic Model with Temporal Dynamics (HTMT) (cf. Section 4.2) which incorporate the deconvolution model into HTM.

### 5.2 Parameter setting

We randomly sample 80% of the data as the training data and use the remaining 20% as the test data. All models are trained on the same training set and evaluated using the same test set. In the training phase we obtain topic-term distribution, the number of topics, and all other hyper-parameters. In the testing phase we fix them and make 200 Gibbs-sampling iterations for each feed in the test set, obtaining *𝜃* _{ d t } and *ϕ* _{ t a }. It is well known that in general we need to use different topics for different datasets to achieve the best topic modelling effect. Hence, we tried the topic models with different values of *T*. The source code is made open to the research community as online supplementary material.^{2}

For DM, we only use twitter messages for clustering with no additional news information. For HTM, we use symmetric Dirichlet priors in the LDA estimation with *α* = 50/*T* and *L* = 0.01, which are common settings in the literature. For HTMT, both HTM and Deconvolution Model need to be tuned. Usually, deconvolution with small decay parameters is the best setting. However, only a high level of deconvolution helps to restore the original topic importance, as well as to depict its temporal dynamics using the desired model. Higher than necessary deconvolution may lead to smaller topics (which are caused by the prior larger neighbours), so we need to apply the lowest deconvolution possible, but in the same time maintain the adequate level of deconvolution using the parameter estimation method as described in the next section. The parameter settings of Deconvolution Model were set to be identical to those in Tsytsarau et al. (2014).

*T*. To find the optimal value for

*T*, we experiment different topic models on the training dataset. In Hong et al. (2011), the best performance is achieved when

*K*= 50. As shown in Figs. 2, 3 and 4, however, it is clear that the optimal performance achieved when the number of topic is equal to 300 for both metrics. One possible explanation is that our dataset is substantially larger than their dataset, which requires larger amount of topics to model it.

From the training stage, one can also see that Heterogeneous Topic Model (HTM) brings substantial performance gain compared to the state-of-the-art approach, the Source-LDA. However, the proposed Deconvolution Model (DM) can only achieve a comparable performance to Source-LDA when applying alone. We conjecture that this is because DM is largely dominated by the local topics of Twitters (due to its high volume). More importantly, when combining DM with HTM, we see that our proposed approach HTMT, which considers both the heterogeneous properties of each source and the interwind temporal dynamics, outperforms all the other approaches, irrespective of the topic number and the evaluation metrics.

### 5.3 Topic model analysis and case study

The performance of different methods on (a) Twitter and (b) News datasets (-*-* and -* indicate the statistical significance of performance decrease from that of SGTM with p-value < 0.01 and p-value < 0.05, respectively)

Source-LDA | HTM | DM | HTMT | |
---|---|---|---|---|

(a) Twitter | ||||

| 11.375-*-* | 10.596-* | 9.642-* | 8.798 |

| 7269.529-*-* | 5924.371-*-* | 4943.582-* | 4136.335 |

(b) News | ||||

| 9.374-*-* | 8.647-*-* | 8.551-* | 8.295 |

| 6654.247-*-* | 5753.293-*-* | 4439.986-* | 3782.961 |

*p*(

*z*|

*d*), which represents the average of the expected value of topic

*k*appearing in a feed on both collections. For each topic in the set, we then select the terms of the highest

*p*(

*w*|

*z*) value. Notice that all these models are trained in an unsupervised fashion with

*k*= 300, all the other settings are the same with the above experiments, the decay function is set as hyperbolic since it exhibits the best performance on the test dataset (cf. Figs. 2, 3 and 4 ). The text have been preprocessed by case-folding and stopwords-removal (using a standard list of 418 common words). Shown in Table 3 are the most representative words of topics generated by Source-LDA, HTM, DM, and HTMT respectively. For topic 1, although different models select slightly different terms, all these terms can describe the corresponding topic to some extent. For topic 2 (Glasgow), however, the words “commercial” and “jobs” of HTMT are more telling than “food” derived by Source-LDA, and “beautiful” and “smart” derived by HTM. Similar subtle differences can be found for the topic 3 and 4 as well. Intuitively, HTMT selects more related terms for each topic than other methods, which shows the better performance of HTMT by considering both the heterogeneous structure and the temporal dynamics information. This observation answered the last research question, where our HTMT framework supersedes HTM by providing more valuable and reinforced information.

The representative terms generated by Source-LDA, HTM, DM, and HTMT models

Topic 1 (VoteSNP) | Topic 2 (Glasgow) | Topic 3 (Scotrail) | Topic 4 (Loch Ness) | |||||
---|---|---|---|---|---|---|---|---|

HTM Source-LDA | VoteSNP | wish | Glasgow | Sccotish | Scotrail | transport | loch | |

Sccotish | support | city | | Sccotish | franchise | nessie | | |

voting | UK | UK | Scotland | route | | monster | deep | |

party | | Scotland | people | train | disruption | inverness | | |

Scotland | member | council | project | rail | lines | visit | life | |

VoteSNP | | Glasgow | merchant | Scotrail | transport | loch | water | |

Scotland | support | Edinburgh | investment | Sccotish | rail | lochness | beautiful | |

Scottish | UK | Scotland | Scotland | station | fares | monster | tour | |

party | delivering | Scotish | | route | | inverness | | |

voting | | council | | train | Edinburgh | | lodge | |

DM | VoteSNP | wish | Glasgow | Sccotish | Scotrail | transport | loch | lodge |

Scotland | | city | London | Sccotish | franchise | lochness | tour | |

Scottish | support | UK | council | route | company | nessie | highland | |

voting | everyone | Scotland | project | train | Edinburgh | monster | | |

UK | party | Edinburgh | merchant | | fare | inverness | | |

HTMT | VoteSNP | | Glasgow | Sccotish | Scotrail | rail | loch | |

Scotland | support | Scotland | | Sccotish | | nessie | highland | |

Scottish | everyone | Edinburgh | Scotland | transportation | | monster | | |

Glasgow | Edinburgh | city | | train | disruption | inverness | | |

party | | council | investment | | franchise | visit | |

### 5.4 Analysis on temporal dynamics

Hashtag, a type of community convention that starts with a “#” sign, have been extensively used as annotations to represent events and topics on Twitter. We select several hashtag that can act as indicators for certain events where each hashtag is clearly associated to some events in April, 2015. More specifically, we choose #Scotrail for “Scottish railway”, and wish to see whether the events can be discovered by different models and how well these models can be presented. We believe these hashtag represent a large range of social events and therefore are representative. A natural question is whether the model can identify topics that reflect the events behind the hashtags. We map hashtags onto the topics obtained by the models and top ranked terms in these topics are examined to see whether the corresponding terms have any relationships with the underlying events.

*p*(

*w*|

*z*) is provided by the trained models and

*p*(

*z*) can be easily estimated by the counts. Intuitively, this probability tells us how likely a topic is to be selected, given the term. We can then compare the time-series of topics and hashtags to determine whether they are similar. Our hypothesis is that if they look similar on the time series, the topic may be good choices for explaining the events behind the hashtags. Notice that we are not seeking the exact match here since the topics have many more terms than a single hashtag and it may explain multiple events. Moreover, we transform the volumes into probabilities. We plot the time series of hashtags and the time series of the selected topics in Figs. 5, 6 and 7.

From the result, HTMT and HTM (red and blue curves) both smooth out the local fluctuations of the topics (gray curves) from the hashtags shown, while preserving the sharp peaks that may indicate a significant change of content in Twitter. Moreover, HTM tend to “overfit” the multiple spikes in the occurrence of “Scotrail” between 300 and 400 of hours. Also, HTMT can better match the peaks of hashtags, indicating that the method can better reflect real events. This may owe to the fact that the hyper-parameters *β* in HTMT are governed by the time-dependent functions of both sources, where the rise and fall of these values may give good hints for the model to assign topic to words, leading to a improved performance on temporal dynamics, and provides additional answers to the last research question at the beginning of this section.

## 6 Conclusion

Mining topics from heterogeneous sources is still a challenge, especially when the temporal dynamics of the sources are intertwined. In this paper, we aim to automatically analyse multiple correlated sources with their corresponding temporal behaviour. The new model goes beyond the existing Source-LDA models because (i) it blends several topic models while preserving the characteristic of each source; (ii) it associates each topic with a deconvolution function that characterise its topic importance and social response over time, which enables our topic model to capture some of the hidden contextual information of feeds.

There are some interesting future work to be continued. First, it will be interesting to investigate more complex aspects related to the temporal dynamics of data stream, e.g., the sentiment shift. Additionally, in order to model and gain insight from real events, topics can be linked with a group of named entities such that each topic can be largely explained by these entities and their relations in the knowledge graphs, e.g., Freebase.^{3}

## Footnotes

## Notes

### Acknowledgements

We thank the anonymous reviewer for their helpful comments. We acknowledge support from the EPSRC funded project named A Situation Aware Information Infrastructure Project (EP/L026015) and from the Economic and Social Research Council [grant number ES/L011921/1]. This work was also partly supported by NSF grant #61572223. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the sponsor.

## References

- Alsumait, L., Barbará, D., & Domeniconi, C. (2008). Online lda: Adaptive topic model for mining text streams with application on topic detection and. ICDM’08.Google Scholar
- Asur, S., Huberman, B.A.s., Szabo, G., & Wang, C. (2011). Trends in social media: Persistence and decay. Available at SSRN 1755748.Google Scholar
- Blei, D.M., & Lafferty, J.D. (2006). Dynamic topic models. In
*ICML ’06*(pp. 113–120).Google Scholar - Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent dirichlet allocation.
*Journal of Machine Learning Research*,*3*, 993–1022.zbMATHGoogle Scholar - Cha, M., Haddadi, H., Benevenuto, F., & Gummadi, P.K. (2010). Measuring user influence in twitter: the million follower fallacy.
*ICWSM ’10*,*10*(10–17), 30.Google Scholar - Crane, R., & Sornette, D. (2008). Robust dynamic classes revealed by measuring the response function of a social system.
*Proceedings of the National Academy of Sciences*,*105*(41), 15649–15653.CrossRefGoogle Scholar - Doyle, G., & Elkan, C. (2009). Accounting for burstiness in topic models. ICML ’09.Google Scholar
- Dubey, A., Hefny, A., Williamson, S., & Xing, E.P. (2013). A nonparametric mixture model for topic modeling over time.Google Scholar
- Fiscus, J.G., & Doddington, G.R. (2002). Topic detection and tracking. chapter Topic Detection and Tracking Evaluation Overview, pp. 17–31.Google Scholar
- Gaikovich, K.P. (2004). Inverse problems in physical diagnostics. Nova Publishers.Google Scholar
- Ghosh, R., & Asur, S. (2013). Mining information from heterogeneous sources: A topic modeling approach. In
*Proc. of the MDS Workshop at the 19th ACM SIGKDD (MDS-SIGKDD’13)*.Google Scholar - Heinrich, G. (2005). Parameter estimation for text analysis. Technical report.Google Scholar
- Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis.
*Machine Learning*,*45*, 256–269.zbMATHGoogle Scholar - Hong, L., Dom, B., Gurumurthy, S., & Tsioutsiouliklis, K. (2011). A time-dependent topic model for multiple text streams.Google Scholar
- Hong, L., Yin, D., Guo, J., & Davison, B.D. (2011). Tracking trends: incorporating term volume into temporal topic models.Google Scholar
- Kirkeby, O., Nelson, P., Hamada, H., Orduna-Bustamante, F., et al. (1998). Fast deconvolution of multichannel systems using regularization.
*IEEE Transactions on Speech and Audio Processing*,*6*(2), 189–194.CrossRefGoogle Scholar - Lau, J.H., Collier, N.s., & Baldwin, T. (2012). On-line trend analysis with topic models: ∖# twitter trends detection topic model online. In
*COLING*(pp. 1519–1534).Google Scholar - Leskovec, J., Backstrom, L., & Kleinberg, J. (2009). Meme-tracking and the dynamics of the news cycle.Google Scholar
- Li, T., Ma, S., & Ogihara, M. (2004). Entropy-based criterion in categorical clustering.Google Scholar
- Mallat, S. (1999). A wavelet tour of signal processing. Academic press.Google Scholar
- Masada, T., Fukagawa, D., Takasu, A., Hamada, T., Shibata, Y., & Oguri, K. (2009). Dynamic hyperparameter optimization for bayesian topical trend analysis. CIKM ’09.Google Scholar
- Miller, J.W., & Alleva, F. (1996). Evaluation of a language model using a clustered model backoff. volume 1 of ICSLP’ 96, pages 390–393.Google Scholar
- Minka, T. (2000). Estimating a dirichlet distribution.Google Scholar
- Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents.Google Scholar
- Tsytsarau, M., Palpanas, T., & Castellanos, M. (2014). Dynamics of news events and social media reaction.Google Scholar
- Wang, C., Blei, D., & Heckerman, D. (2012). Continuous time dynamic topic models. arXiv:1206.3298.
- Wang, X., & McCallum, A. (2006). Topics over time: a non-markov continuous-time model of topical trends.Google Scholar
- Zhai, C.X., Velivelli, A., & Yu, B. (2004). A cross-collection mixture model for comparative text mining. In
*KDD ’04*(pp. 743–748).Google Scholar - Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., & Li, X. (2011). Comparing twitter and traditional media using topic models. In
*ECIR’11*(pp. 338–349).Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.