Advertisement

Journal of Intelligent Information Systems

, Volume 51, Issue 1, pp 115–137 | Cite as

Topic detection and tracking on heterogeneous information

  • Long ChenEmail author
  • Huaizhi Zhang
  • Joemon M Jose
  • Haitao Yu
  • Yashar Moshfeghi
  • Peter Triantafillou
Open Access
Article

Abstract

Given the proliferation of social media and the abundance of news feeds, a substantial amount of real-time content is distributed through disparate sources, which makes it increasingly difficult to glean and distill useful information. Although combining heterogeneous sources for topic detection has gained attention from several research communities, most of them fail to consider the interaction among different sources and their intertwined temporal dynamics. To address this concern, we studied the dynamics of topics from heterogeneous sources by exploiting both their individual properties (including temporal features) and their inter-relationships. We first implemented a heterogeneous topic model that enables topic–topic correspondence between the sources by iteratively updating its topic–word distribution. To capture temporal dynamics, the topics are then correlated with a time-dependent function that can characterise its social response and popularity over time. We extensively evaluate the proposed approach and compare to the state-of-the-art techniques on heterogeneous collection. Experimental results demonstrate that our approach can significantly outperform the existing ones.

Keywords

Topic detection Heterogeneous sources Temporal dynamics Social response Topic importance 

1 Introduction

Social media, such as Twitter and Facebook, has been used widely for communicating breaking news, eye witness account, and even organising flash mobs. Users of these websites have become accustomed to receiving timely updates on important events. For example, Twitter was heavily used in numerous international events, such as the Ukrainian crisis (2014) and the Malaysia Airlines Flight 370 crisis (2015). From a user consumer’s perspective, it makes sense to combine social media with traditional news outlets, e.g., BBC news, for timely and effective news consumption. However, the latter have different temporal dynamics than the former, which entails a deep understanding of the interaction between the new and old sources of news.

Combining heterogeneous sources of news has been investigated by several research communities (cf. Section 2). However, existing works mostly merge documents from all sources into a single collection, and then apply topic modelling techniques to it for detecting common topics. This may cause a biased result in favour of the source with a high frequencies of publication. Furthermore, the heterogeneity that characterises each source may not be maintained. For example, the Twitter data stream is distinctively biased towards current topics and temporal activity of users, this means for effective topic modelling, computational treatment of user’s social behaviours (such as author information and the number of retweets) is also needed. Alternatively, running the existing topic models on each source can preserve the characteristics of each source, but make it difficult to capture a common topic distribution and the interaction among different sources. To solve the aforementioned problems, we present a heterogeneous topic model that can combine multiple disparate sources in a complementary manner by assuming a variable of common-topic distribution for both Twitter and news collections.

Furthermore, people would not only like to know the type of topic that can be found from these disparate data sources but also desire to understand their temporal dynamics, as well as the topic importance. However, the dynamics of most topic streams are intertwined with each other across sources such that their impact is not easily recognisable. To determine the impact of a topic, it is critical to consider the evolution of the aggregated social response from social media (i.e., Twitter), in addition to the temporal dynamics of news media. However, news media can have the news cycle (Tsytsarau et al. 2014) all by themselves while keeping a growth shape, such that the burst shape of publication does not always consistent with the beginning of the topic. Therefore, we used a deconvolution approach (a well-known technique in audio and signal processing (Kirkeby et al. 1998; Mallat 1999)) that can address these concerns by using a special compound function that considers both topic importance and its social response in social media.

In this study, we addressed the problem of Topic Detection and Tracking (TDT) from heterogeneous sources. Our study is based on recent advances in both topic modelling and information cascading in social media. In particular, we designed a heterogeneous topic model that allows information from disparate sources to communicate with each other while maintaining the properties of each source in a unified framework. For temporal modelling, we proposed a compound function through convolution, which optimally balances the topic importance and its social response. By combining these two models, we effectively modelled temporal dynamics from disparate sources in a principled manner.

The rest of the paper is organised as follows. Section 2 describes background and related work. In Sections 3 and 4, we discuss our model in details. Section 5 provides experimental results on two real-world datasets. We present our conclusion and future work in Section 6.

2 Related work

There are approaches that tackle sub-tasks of our problem in various domains, however, they cannot be combined to solve the problem that our model solves. To the best of our knowledge, there is no existing work that is capable of automatically detecting and tracking topics from heterogeneous sources while simultaneously preserving the properties of each source. The task of this paper can be loosely organised into two independent sub-tasks: (1) topic detection from heterogeneous sources (2) characterising their temporal dynamics. In this section, we will review these two lines of related work.

2.1 Topic detection from heterogeneous sources

One of the track tasks included in the Topic Detection and Tracking (TDT) (Fiscus & Doddington 2002) is topic detection, where systems cluster streamed stories into bins depending on the topics being discussed. The techniques of topic modelling, such as PLSA (Hofmann 2001) and LDA (Blei et al. 2003), have shown to be effective for topic detection. In the seminal work of online-LDA model (Alsumait et al. 2008), the authors update the LDA incrementally with information inferred from the new stream of data. Lau et al. (2012) subsequently demonstrated the effectiveness of Online-LDA for Twitter data streams.

However, there has not been extensive research on the development of topic detection from heterogeneous sources. Zhai et al. (2004) proposed a cross-collection mixture model to detect common topics and local ones respectively. The state-of-the-art approach (Hong et al. 2011) utilises the collection model for mining from multiple sources, together with a meme-tracking model that iteratively updates the hyper-parameter that controls the document-topic distribution to capture the temporal dynamics of the topic. In the Collection Model, a word belongs to either the local topic or the common topic, the probability of which is drawn from a Bernoulli distribution. Common topics are obtained by merging all documents from all sources into one single collection, and then apply LDA on it. Local topics are computed by using LDA on each source individually. But it assumes that there is no correspondence between local topics across different sources, nor is there exchanging information between the local topics and the common ones. While in real-world scenarios, information from multiple sources constantly interacted with each other as the topic evolves. To address this problem, Ghosh & Asur (2013) proposed a Source LDA model to detect topics from multiple sources with the aim to incorporate source interactions, which is somewhat similar to our idea. However, there are stark differences between their work and ours: First, their model assumes that there is no order for the documents in the collection, hence the temporal dynamics of each source is completely ignored. Secondly, since the same LDA model is applied over different sources, they didn’t exploit the properties that characterise each data source, whereas we use Author Topic model and LDA for Tweets and News sources respectively (cf. Section 3).

2.2 Characterizing temporal dynamics and social response

In the seminal work of dynamic topic modelling (Blei & Lafferty 2006), each topic defines a multinomial distribution over a set of terms. Therefore, for each word of each document, a topic is drawn from the mixture and a term is subsequently drawn from the multinomial distribution corresponding to that topic. This has led to the recent development of incorporating temporal dynamics into topic models (e.g., Wang et al. 2012; Hong et al. 2011; Masada et al. 2009; Wang and McCallum 2006; Dubey et al. 2013). These models enable us to gain insight into datasets with temporal changes in a convenient way and open future directions for utilizing these models in a more general fashion. However, their analysis was conducted on an academic dataset. To investigate the effectiveness of dynamic topic model, Leskovec et al. (2009) proposed a framework to capture the textual variants over different phrases. It is based on two assumptions for the interaction of Memes sources: imitation and recency. The imitation hypothesis assumes that news sources are more likely to publish on events that have already seen large volume of publications. The recency hypothesis marks the tendency to publish more on recent events. While effective, their work only focused on the temporal dynamics of one source, namely the news outlets, whereas in this paper two different types of sources are considered simultaneously.

In addition, while there are previous work investigate the temporal dynamics in a heterogeneous context (Hong et al. 2011), a deeper understanding of topic dynamics entails extraction of burst shapes and modelling of social response (Tsytsarau et al. 2014). This requirement is important as publication volume often contains background information, which may mask individual patterns of topics. Another important factor is topic importance, which is first proposed in Cha et al. (2010). The authors studied the influence within Twitter, and performed a comparison of three different measures of influence- indegree, retweets, and user mentions. They discovered that retweets and user mentions are better measure of topic influence than indegree. Unlike the previous studies, in this paper, we aim to incorporate both topic importance and social response into a unified topic modelling framework.

3 Heterogenous topic model

In this section, we will explain the heterogeneous topic model, which can detect topics in heterogeneous sources while preserving the properties of each source. Throughout the paper, we have used the general term “document” or “feed” to cover the basic text collections. For example, in the context of news media, a feed is a news article; whereas for Twitter, a feed represents a tweet message.

3.1 Model description

To correctly model the topics and the distinct characteristics of both Twitter and Newswire, we propose Heterogeneous Topic Model, abbreviated as HTM. Our model can identify topics across disparate sources while preserving the properties of each source. For each source, any local topic i of any source j would correspond to topic i of another source k, where topic i is conformed to the properties of the local source. With the notation given in Table 1, Fig. 1 illustrates the graphical model for HTM, which blends two topic modelling techniques, namely, Author Topic Model (ATM) (Rosen-Zvi et al. 2004) and Latent Dirichlet Allocation (LDA).
Table 1

Notations used in HTM

Symbol

Description

N

Number of words in the collection(Twitter and News)

N d

Number of common words appeared in both collections

W

Vocabulary size

T

Number of topics

A

Number of authors in Twitters

C W T

Number of words assigned to the topic of a word

C T A

Number of words assigned to the topic of an author

\(n_{d}^{(t)}\)

Frequency that topic t assigned

 

to a word in document d

\(n_{t}^{(w)}\)

Frequency that word w assigned to topic t

\(w^{(k)}_{di}\)

Words in Twitter document d

\(w^{(n)}_{di}\)

Words in News document d

z

Topic assignment

z d i

Topic assignment for word w d i

x

Author assginments

x d i

Author assignment for word w d i

\(\mathcal {A}\)

Authors of the corpus in Twitters

α ( k)

Dirichlet prior for Twitter

α ( n)

Dirichlet prior of News document

α t

Dirichlet prior for topic t in News document

β

Dirichlet prior

β w

Dirichlet prior of word w

η

Probabilities of words given on topics

Fig. 1

The framework of heterogenous topic model (cf. Table 1)

The various probability distributions we can learn from the HTM model characterise the different factors of each source that can affect the topics. For the generation of content, each word w (n) in news articles is only associated with a topic z, and each word w (k) in tweets is related to two latent variables, namely, an author a and a topic z. The communication between the topics of different sources is governed by a parameter, η, which represents the common topic distribution. β is the prior distribution of η. The observed variables include author names of tweets, words in the Twitters dataset, and words in the News dataset, the rest are all unobserved variables. Notice that the D in the left side of the figure represent the Twitter documents and the D in the right side of the figure is the News documents.

Conditioned on the set of authors from Twitters and its distribution over topics, the generative process of the HTM are summarized in Algorithm 1, where v a r i a b l e (k) represents the variable of Twitters and v a r i a b l e (n) denotes the variable of news. The local words represents the ones that only appeared in a single source, whereas the common words are the ones occurred in all the sources. Under the generative process, each common topic z in News and Twitters is drawn independently when conditioned on Θ and Φ, respectively.

Holding the conditional independence property, we have the following basic equation for Gibbs sampler:
$$\begin{array}{@{}rcl@{}} && p(z_{di, dj} = t | w^{(k)}_{di} = w_{i}, z_{-di}, x_{-di}, w^{(k)}_{-di}, \\ &&\qquad w^{(n)}_{dj} = w_{j}, z_{-dj}, w^{(n)}_{-dj}, \mathcal{A}, \alpha^{(k)}, \alpha^{(n)}, \beta)\\ &&\propto p(z_{di, dj} = t, w^{(k)}_{di} = w_{i}, w^{(n)}_{dj} = w_{j} | z_{-di}, x_{-di}, \\ &&\qquad w^{(k)}_{-di}, z_{-dj}, w^{(n)}_{-dj}, \mathcal{A}, \alpha^{(k)}, \alpha^{(n)}, \beta)\\ &&= \frac{p(z, w^{(k)}, w^{(n)}| \mathcal{A}, \alpha^{(k)}, \alpha^{(n)}, \beta)} {p(z_{-di}, z_{-dj}, w^{(k)}_{-di}, w^{(n)}_{-dj}| \mathcal{A}, \alpha^{(k)}, \alpha^{(n)}, \beta)}\\ &&= \frac{p(z, w^{(k)} | \mathcal{A}, \alpha^{(k)}, \beta)} {p(z_{-di}, w^{(k)}_{-di} | \mathcal{A}, \alpha^{(k)}, \beta)} \cdot \frac{p(z, w^{(n)} | \alpha^{(n)}, \beta)}{p(z_{-dj}, w^{(n)}_{-dj} | \alpha^{(n)}, \beta)} \end{array} $$
(1)
where z d i , w d i stand for the vector of topic assignments and word observations except for the i t h word of news document d. z d j , w d j stand for the vector of topic assignments and word observations except for the j t h word of tweet d.
After integrating the joint distribution of the variables (cf. Appendix), we get the following equation for Gibbs sampler:
$$\begin{array}{@{}rcl@{}} & p(z_{di, dj} = t | w^{(k)}_{di} = w_{i}, z_{-di}, x_{-di}, w^{(k)}_{-di}, \\ &\qquad w^{(n)}_{dj} = w_{j}, z_{-dj}, w^{(n)}_{-dj}, \mathcal{A}, \alpha^{(k)}, \alpha^{(n)}, \beta)\\ &= \frac{C_{wt, -di}^{WT} + \beta_{w}}{{\sum}_{w^{\prime}}C_{w^{\prime}t,-di}^{WT} + W\beta} \cdot \frac{C_{ta, -di}^{TA} + \alpha^{(k)}}{{\sum}_{t^{\prime}}C_{t^{\prime}a,-di}^{TA} + T\alpha^{(k)}} \\ &\qquad \cdot \frac{n_{t, -i}^{(w)} + \beta_{w}} {{\sum}_{w=1}^{V} n_{t, -i}^{(w)} + \beta_{w}} \cdot \frac{n_{d, -i}^{(t)} + \alpha_{t}} {[{\sum}_{t=1}^{T} n_{d}^{(t)} + \alpha_{t}] - 1} \end{array} $$
(2)

Note that, the sampled words from one collection may not be observed in another collection. In such cases, the prior probability of topic over word is set as one.

3.2 Model fitting via Gibbs sampling

In order to estimate the hidden variables of HTM, we use collapsed Gibbs Sampling. However, the derivation of posterior distributions for Gibbs sampling in HTM is complicated by the fact that common distribution η is a joint distribution of two mixtures. As a result, we need to compute the joint distribution of w (k) and w (n) in the Gibbs sampling process. The posterior distributions for Gibbs sampling in HTM are
$$\begin{array}{@{}rcl@{}} &&\beta_{t}^{(n)} |\mathrm{\textbf{z}}, \mathcal{D}^{\text{train}}, \beta \sim \text{Dirichlet}\left( C_{t}^{WT}+(C_{t}^{WT})^{(n)}+\beta\right) \end{array} $$
(3)
$$\begin{array}{@{}rcl@{}} &&\beta_{t}^{(k)} |\mathrm{\textbf{z}}, \mathcal{D}^{\text{train}}, \beta \sim \text{Dirichlet}\left( C_{t}^{WT}+(C_{t}^{WT})^{(k)}+\beta\right) \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} &&\beta_{t} |\mathrm{\textbf{z}}, \mathcal{D}^{\text{train}}, \beta \sim \text{Dirichlet}(C_{t}^{WT}+(C^{WT})^{(n)} \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} &&\qquad\qquad\qquad\qquad+(C^{WT})^{(k)}+\beta)\\ &&\phi_{a} |\mathrm{\textbf{x}}, \mathrm{\textbf{z}}, \mathcal{D}^{\text{train}}, \alpha^{(k)} \sim \text{Dirichlet}(C_{a}^{TA}+\alpha^{(k)}) \end{array} $$
(6)
$$\begin{array}{@{}rcl@{}} &&\theta_{d} |\mathrm{\textbf{w}}, \mathrm{\textbf{z}}, \mathcal{D}^{\text{train}}, \alpha^{(n)} \sim \text{Dirichlet}(n_{d}+\alpha^{(n)}) \end{array} $$
(7)
where \(\beta _{t}^{(n)}\) represents the local topics that belong to News; (C W T )(n), (C W T )(k) indicates the sample of topic-term matrix in which each term is observed only in Twitter and only in News respectively, (C W T ) is the sample of topic-term matrix in which each word can be observed from the collections; \(\beta _{t}^{(t)}\) is the local topics that belong to Twitter. Since the Dirichlet distribution is conjugate to the Multinomial distribution, the posterior mean of \(\mathcal {A}\), Θ and Φ given x, z, \(\mathcal {\textbf {w}}\), \(\mathcal {D}^{\text {train}}\), α (n), α (k) and β can be obtained as follows:
$$\begin{array}{@{}rcl@{}} &&E[\beta_{wt}^{(n)} | \mathrm{\textbf{z}}^{s}, \mathcal{D}^{\text{train}}, \beta] = \frac{\left( {C_{wt}^{WT} + (C_{wt}^{WT})^{(n)}}\right)^{s} + \beta_{w}} {{\sum}_{w^{\prime}} \left( C_{w^{\prime}t}^{WT} + (C_{w^{\prime}t}^{WT})^{(n)}\right)^{s} + W\beta} \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} &&E[\beta_{wt}^{(k)} | \mathrm{\textbf{z}}^{s}, \mathcal{D}^{\text{train}}, \beta] = \frac{\left( {C_{wt}^{WT} + (C_{wt}^{WT})^{(k)}}\right)^{s} + \beta_{w}} {{\sum}_{w^{\prime}} \left( C_{w^{\prime}t}^{WT} + (C_{w^{\prime}t}^{WT})^{(k)}\right)^{s} + W\beta} \end{array} $$
(9)
$$\begin{array}{@{}rcl@{}} &&E[\beta_{wt} | \mathrm{\textbf{z}}^{s}, \mathcal{D}^{\text{train}}, \beta] = \frac{\left( {C_{wt}^{WT} + (C_{wt}^{WT})^{(n)}} + (C_{wt}^{WT})^{(k)}\right)^{s} + \beta_{w}} {{\sum}_{w^{\prime}} \left( C_{w^{\prime}t}^{WT} + (C_{w^{\prime}t}^{WT})^{(n)} + (C_{w^{\prime}t}^{WT})^{(k)}\right)^{s} + W\beta} \end{array} $$
(10)
$$\begin{array}{@{}rcl@{}} &&E[\phi_{ta} | \mathrm{\textbf{z}}^{s}, \mathrm{\textbf{x}}^{s}, \mathcal{D}^{\text{train}}, \alpha^{(k)}] = \frac{(C_{ta}^{TA})^{s} + \alpha^{(k)}}{{\sum}_{t^{\prime}} (C_{t^{\prime}a}^{TA})^{s} + T\alpha^{(k)}} \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} &&E[\theta_{dt}|\mathrm{\textbf{w}}^{s}, \mathrm{\textbf{z}}^{s}, \mathcal{D}^{\text{train}}, \alpha^{(n)}] = \frac{\left( n_{d}^{(t)}\right)^{s} + \alpha_{t}} {{\sum}_{t=1}^{T} \left( n_{d}^{(t)}\right)^{s} + \alpha_{t}} \end{array} $$
(12)
where s refers to the sample from Gibbs sampler of the full collection. The posterior \(\mathcal {A}\), Θ and Φ, correspond to the author distribution on topics, the topic distribution on words, and the document distribution on topics respectively.

4 Modelling temporal dynamics

In this section, we review a temporal model for topics, introduced in Tsytsarau et al. (2014) and present an alternate derivation. We start with the description of the basic social response function and our representation of topic importance. Then we introduce a deconvolution approach over the time series of news and Twitter data, in order to extract important properties of topics.

4.1 Modelling impacting topics

As one may notice, not every publications outbursts is derived from external stimuli. For example, there are two different types of dynamics in Twitter: daily activity and trending activity. The former is mostly driven by work schedules of time zones and the latter is caused by a more clear pattern of topic interest and is the subject of our study. We start by assuming the following setting, which assumes the observed topic dynamics (volume of publications) as a response of social media and topic importance. The result is decomposed into two functions: the topics importance function and the social media response function:
$$ n(t) = {\int}_{-\infty}^{+\infty}srf(\tau) e(t-\tau,r) d\tau $$
(13)
where s r f(t) is the social response function; and e(t,r), which reflects the importance of the topic during time t, is the joint function of actual topic sequence t and the number of retweets r. e(t,r) = l n(r)(a + b)t 0b t,t > t 0; e(t,r) = l n(r)a t,0 < t < t 0. a is buildup rate and b is decay rate. The intuition behind e(t,r) is that certain topics should have a better chance of being selected, since they had a higher popularity and a larger volume of social response. For example, during the ”heat wave action”, the hottest ever day of UK in twelve years, news articles and Twitter messages are more inclined to talk about weather, rather than politics. Furthermore, we observe that hot topics usually correspond to a higher retweet number. The form of (13) has been demonstrated to be effective for capturing spikes of news articles and Tweets (Hong et al. 2011).
However, in order to restore the original topic sequence, it is important to know the exact shape of s r f(t). To model the shape of s r f(t), we propose to employ a family of normalized decaying functions (Asur et al. 2011) demonstrated in the following equations:
$$\begin{array}{@{}rcl@{}} && linear \quad srf(t) = (\frac{2}{\tau_{0}}-\frac{2t}{\tau_{0}}) h(t) h(\tau_{0}-t) \end{array} $$
(14)
$$\begin{array}{@{}rcl@{}} && hyperbolic \quad srf(t) = h(t)\frac{\alpha-1}{\tau_{0}}\frac{t+\tau_{0}}{\tau_{0}}^{-\alpha} \end{array} $$
(15)
$$\begin{array}{@{}rcl@{}} && exponential \quad srf(t) = \frac{1}{\tau_{0}}e^{-t/\tau_{0}}h(t) \end{array} $$
(16)
where the linear response has the shortest effect and hyperbolic response has the longest effect on time series. We employ decaying response functions for two reasons. First, topics often become obsolete and cease being published in a short time period. Second, the shapes of response functions often bear additional information regarding impact and expectation of topic.

4.2 Topic deconvolution

Deconvolution is the opposite process of convolution (Gaikovich 2004), which aims to recreate the original topical importance sequence. The Convolution theorem states that the Fourier transformation of a time-domain convolution of two series is equal to the multiplication of their Fourier transformation in the frequency domain:
$$\begin{array}{@{}rcl@{}} \mathbb{F}\{n(t)\} = \mathbb{F}\{e(t,r)*srf(t)\} = \mathbb{F}\{e(t,r)\} \cdot \mathbb{F}\{srf(t)\} \end{array} $$
(17)
The problem now lies in how to integrate the temporal dynamics described above into our heterogeneous topic model and then introduce the fitting process to estimate its parameters. We encode the social response and topic importance by associating the Dirichlet parameters for each topic with a time-dependent function, which controls the popularity of the associated topics and the level of social response. Specifically, we let each dimension β w in Dirichlet parameter β be associated with the following time-dependent function.
$$ \beta_k(t) = f_k(t) = \mathbb{F}\{e(t,r)\} \cdot \mathbb{F}\{srf(t)\} $$
(18)
where f k (t) is the deconvolution model described in Section 4.2. However, if we naively associate β k with f k , the model may consider the starting point of time t for all topics is timestamp 0. In fact, different topics have different starting point t 0. Thus we modify it into the following form:
$$ f_k(t) = N_k+\mu(t-t_0^k)|t-t_0^k|^{q_{k}} *\mathbb{F }\{e(|t-t_0^k|,r)\} \cdot \mathbb{F} \{srf(|t-t_0^k|)\} $$
(19)
where t 0 is the starting timestamp of the topic, q k indicates how quickly the topic would rise to the peak, and N k is the noise level of the topic. The absolute value function garantees that the time-dependent part is only active when t 0 is larger than t 0. \(\mu (t-{t_{0}^{k}})\) is a boolean function that is 1 for t > t 0 and 0 otherwise. The intuition behind this equation is that the prior knowledge of each topic is fixed over time (by the “noise” level N k ). The crux of the problem is to estimate the values of these 3 hyper-parameters from the data.

The procedures for parameter estimation is summarised in Algorithm 2. Generally, we integrate the deconvolution function with Gibbs Sampling into EM framework (e.g., similar to (Doyle & Elkan 2009)). In the E-step, we gather topic assignments and useful counts by Gibbs sampling using (3). In the M-step we optimise the proposed deconvolution functions to obtain the updated hyper-parameters for the next iteration. More specifically, the first step is to calculate the Dirichlet parameters β from word frequency observed from Gibbs Sampler. This can be done in several ways (Minka 2000). We use Newton’s method in this step, where The step parameter ξ can be interpreted as a controlling factor of smoothing the topic distribution among the common topics. The second step is to use these β values to fit the deconvolution function (19) and then, use the parameters from the fitted deconvolution function as initial values to fit our temporal dynamic function (19).

5 Experiment and results

In particular, this paper aims to investigate three main research questions:
  1. 1.

    How to evaluate the performance of topic modelling techniques in the intertwined heterogeneous context?

     
  2. 2.

    How effective is our proposed heterogeneous topic model compared to other topic models in the above context?

     
  3. 3.

    Can temporal dynamics of each sources be exploited effectively to further enhance the performance of topic modelling?

     

5.1 Datasets and metrics

To construct two parallel datasets, we crawled 30 million tweet from 786,823 users from the first hour of April 1, 2015 to 720, the last hour of April 30, 2015 in a matter of one month in the region of Scotland. All these tweets follow one of the following rules, what we call, Scotland-related information centres. It consists of person names, places, and organisations that share information related to Scotland.

In addition, over the entire month, we crawled 224,272 unique resolved URLs. The News dataset is obtained through the Boilerpipe program.1 The total dataset size is 4GB and essentially includes complete online coverage: we have all mainstream media sites plus 100,000 million blogs, forums, and other media sites. All our experiments are based on these two datasets.

For each web page, we collected (1) The title of the web page. (2) Text content of the web page (after removing tags, js, etc.) (3) Tweets linking to the page. For both the Twitter and the news media datasets, we first removed all the stop words. Next, the words with a document frequency of less than 10 and words that appeared in more than 70% of the tweets (news feeds) were also removed. Finally, for Twitter data, we further removed tweets with fewer than three words and all the users with fewer than 8 tweets.

The first metric we used is perplexity, which aims to measure how well a probability model can predict a new coming feed (Rosen-Zvi et al. 2004). Better generalization performance is indicated by a lower value over a held-out feed collection. The basic equation is:
$$ perplexity = exp(-\frac{{\sum}_{d=1}^{D} {\sum}_{i=1}^{N_{d}} \log\: p(w_{d,i} | \mathcal{D}^{\text{train}}, \alpha^{(n)}, \alpha^{(k)}, \beta)} {{\sum}_{d=1}^{D} N_{d}}) $$
(20)
where w d,i represents the i t h word in feed d. Note that the perplexity is defined by summing over the feeds.
However, the output of different topic models may be significantly different. For example, the outputs of Source-LDA are document-topic distribution and topic-term distribution, while HTM generates topic-term and author-topic distributions:
$$\begin{array}{@{}rcl@{}} P_{\text{LDA}}(w_{d,i} | \mathcal{D}^{\text{train}}, \alpha^{(n)}, \alpha^{(k)}, \beta) &= \sum\limits_{t=1}^{T} \theta_{dt} \eta_{tw} \end{array} $$
(21)
$$\begin{array}{@{}rcl@{}} P_{\text{ATM}}(w_{d,i} | \mathcal{D}^{\text{train}}, \alpha^{(n)}, \alpha^{(k)}, \beta) &= \sum\limits_{t=1}^{T} \phi_{ta} \eta_{tw} \end{array} $$
(22)
As a result, the outputs of different models are incomparable. Recalling our first research questions at the beginning of Section 5, to make a fair comparison, we customise the perplexity metric into the heterogeneous context
$$ perplexity(T) = exp(-\frac{{\sum}_{t=1}^{T} \log\: p(w_{d,i} | \mathcal{D}^{\text{train}}, \alpha^{(n)}, \alpha^{(k)}, \beta)} {{\sum}_{d=1}^{D} N_{d}}) $$
(23)
where \(p(w_{d,i} | \mathcal {D}^{\text {train}}, \alpha ^{(n)}, \alpha ^{(k)}, \beta )\) equals to η t w , and T is the total number of topics. The proposed topic perplexity can explain how well each topic model can predict an unseen topic. A low value of perplexity is an indicator of good performance for the topic model that is being evaluated.
Another common metric to evaluate the performance of topic model is entropy (Li et al. 2004), which denotes the expected value of information contained in the message. Similar to the perplexity, we define the following equation to compute the topic entropy.
$$ entropy(T) = exp(- \frac{{\sum}_{d=1}^{D} \frac{1}{N_{d}} {\sum}_{i=1}^{N_{d}} \log\: p(z | \mathcal{D}^{\text{train}}, \alpha^{(n)}, \alpha^{(k)}, \beta)} {N}) $$
(24)

Again, the smaller value of the entropy measure, the better are the topics since it indicates a better discriminative power.

This paper compares four topic models:
  1. 1.

    Source-LDA, which identifies the latent topics by leveraging the distribution of each of the individual sources. Notice that we use Twitter as the main source as it has reported in Ghosh & Asur (2013) that choosing Twitter as the main source can achieve an superior performance over other options.

     
  2. 2.

    Heterogeneous Topic Model (HTM), the model we described in Section 3 which applies Author Topic Model (ATM) on Twitter dataset and LDA on News dataset, as a result the properties of each source can be preserved.

     
  3. 3.

    Deconvolution Model (DM), which simply integrate the deconvolution model described in Section 4.2 into Author Topic Model (ATM) of Twitter dataset.

     
  4. 4.

    Heterogeneous Topic Model with Temporal Dynamics (HTMT) (cf. Section 4.2) which incorporate the deconvolution model into HTM.

     

5.2 Parameter setting

We randomly sample 80% of the data as the training data and use the remaining 20% as the test data. All models are trained on the same training set and evaluated using the same test set. In the training phase we obtain topic-term distribution, the number of topics, and all other hyper-parameters. In the testing phase we fix them and make 200 Gibbs-sampling iterations for each feed in the test set, obtaining 𝜃 d t and ϕ t a . It is well known that in general we need to use different topics for different datasets to achieve the best topic modelling effect. Hence, we tried the topic models with different values of T. The source code is made open to the research community as online supplementary material.2

For DM, we only use twitter messages for clustering with no additional news information. For HTM, we use symmetric Dirichlet priors in the LDA estimation with α = 50/T and L = 0.01, which are common settings in the literature. For HTMT, both HTM and Deconvolution Model need to be tuned. Usually, deconvolution with small decay parameters is the best setting. However, only a high level of deconvolution helps to restore the original topic importance, as well as to depict its temporal dynamics using the desired model. Higher than necessary deconvolution may lead to smaller topics (which are caused by the prior larger neighbours), so we need to apply the lowest deconvolution possible, but in the same time maintain the adequate level of deconvolution using the parameter estimation method as described in the next section. The parameter settings of Deconvolution Model were set to be identical to those in Tsytsarau et al. (2014).

Another important parameter for topic modelling is the number of topic T. To find the optimal value for T, we experiment different topic models on the training dataset. In Hong et al. (2011), the best performance is achieved when K = 50. As shown in Figs. 23 and 4, however, it is clear that the optimal performance achieved when the number of topic is equal to 300 for both metrics. One possible explanation is that our dataset is substantially larger than their dataset, which requires larger amount of topics to model it.
Fig. 2

This figure shows the topic entropy and perplexity of topic models with varying number of topics T with a linear decay function

Fig. 3

This figure shows the topic entropy and perplexity of topic models with varying number of topics T with a hyperbolic decay function

Fig. 4

This figure shows the topic entropy and perplexity of topic models with varying number of topics T with a exponetial decay function

From the training stage, one can also see that Heterogeneous Topic Model (HTM) brings substantial performance gain compared to the state-of-the-art approach, the Source-LDA. However, the proposed Deconvolution Model (DM) can only achieve a comparable performance to Source-LDA when applying alone. We conjecture that this is because DM is largely dominated by the local topics of Twitters (due to its high volume). More importantly, when combining DM with HTM, we see that our proposed approach HTMT, which considers both the heterogeneous properties of each source and the interwind temporal dynamics, outperforms all the other approaches, irrespective of the topic number and the evaluation metrics.

5.3 Topic model analysis and case study

A comparison using the paired t-test is conducted for HTM, DM, and Source-LDA over HTMT, as shown in Table 2. It is clear that HTMT outperform all baseline methods significantly on both Twitter and News dataset. This indicates that combining heterogeneous source with temporal dynamics significantly improved the performance of topic modelling.
Table 2

The performance of different methods on (a) Twitter and (b) News datasets (-*-* and -* indicate the statistical significance of performance decrease from that of SGTM with p-value < 0.01 and p-value < 0.05, respectively)

 

Source-LDA

HTM

DM

HTMT

(a) Twitter

    

entro

11.375-*-*

10.596-*

9.642-*

8.798

perp

7269.529-*-*

5924.371-*-*

4943.582-*

4136.335

(b) News

    

entro

9.374-*-*

8.647-*-*

8.551-*

8.295

perp

6654.247-*-*

5753.293-*-*

4439.986-*

3782.961

In order to visualise the hidden topics and compare different approaches, we extract topics from training datasets using Source-LDA, HTM, DM, and HTMT. Since both sources consist of a mixture of Scottish subjects, it is interesting to see whether the extracted topics could reflect this mixture. We select a set of topics with the highest value of p(z|d), which represents the average of the expected value of topic k appearing in a feed on both collections. For each topic in the set, we then select the terms of the highest p(w|z) value. Notice that all these models are trained in an unsupervised fashion with k = 300, all the other settings are the same with the above experiments, the decay function is set as hyperbolic since it exhibits the best performance on the test dataset (cf. Figs. 23 and 4 ). The text have been preprocessed by case-folding and stopwords-removal (using a standard list of 418 common words). Shown in Table 3 are the most representative words of topics generated by Source-LDA, HTM, DM, and HTMT respectively. For topic 1, although different models select slightly different terms, all these terms can describe the corresponding topic to some extent. For topic 2 (Glasgow), however, the words “commercial” and “jobs” of HTMT are more telling than “food” derived by Source-LDA, and “beautiful” and “smart” derived by HTM. Similar subtle differences can be found for the topic 3 and 4 as well. Intuitively, HTMT selects more related terms for each topic than other methods, which shows the better performance of HTMT by considering both the heterogeneous structure and the temporal dynamics information. This observation answered the last research question, where our HTMT framework supersedes HTM by providing more valuable and reinforced information.
Table 3

The representative terms generated by Source-LDA, HTM, DM, and HTMT models

 

Topic 1 (VoteSNP)

Topic 2 (Glasgow)

Topic 3 (Scotrail)

Topic 4 (Loch Ness)

HTM Source-LDA

VoteSNP

wish

Glasgow

Sccotish

Scotrail

transport

loch

catfish

 

Sccotish

support

city

food

Sccotish

franchise

nessie

marathon

 

voting

UK

UK

Scotland

route

company

monster

deep

 

party

first

Scotland

people

train

disruption

inverness

food

 

Scotland

member

council

project

rail

lines

visit

life

 

VoteSNP

leaflet

Glasgow

merchant

Scotrail

transport

loch

water

 

Scotland

support

Edinburgh

investment

Sccotish

rail

lochness

beautiful

 

Scottish

UK

Scotland

Scotland

station

fares

monster

tour

 

party

delivering

Scotish

beautiful

route

journey

inverness

picture

 

voting

campaign

council

smart

train

Edinburgh

celebrity

lodge

DM

VoteSNP

wish

Glasgow

Sccotish

Scotrail

transport

loch

lodge

 

Scotland

love

city

London

Sccotish

franchise

lochness

tour

 

Scottish

support

UK

council

route

company

nessie

highland

 

voting

everyone

Scotland

project

train

Edinburgh

monster

best

 

UK

party

Edinburgh

merchant

ticket

fare

inverness

life

HTMT

VoteSNP

labour

Glasgow

Sccotish

Scotrail

rail

loch

hideaway

 

Scotland

support

Scotland

commercial

Sccotish

delay

nessie

highland

 

Scottish

everyone

Edinburgh

Scotland

transportation

signal

monster

deep

 

Glasgow

Edinburgh

city

job

train

disruption

inverness

uncover

 

party

action

council

investment

smoker

franchise

visit

home

The terms are vertically ranked according to the probability P(w|z)

Bold and underlined data indicates the unique words that are captured by our model

5.4 Analysis on temporal dynamics

Hashtag, a type of community convention that starts with a “#” sign, have been extensively used as annotations to represent events and topics on Twitter. We select several hashtag that can act as indicators for certain events where each hashtag is clearly associated to some events in April, 2015. More specifically, we choose #Scotrail for “Scottish railway”, and wish to see whether the events can be discovered by different models and how well these models can be presented. We believe these hashtag represent a large range of social events and therefore are representative. A natural question is whether the model can identify topics that reflect the events behind the hashtags. We map hashtags onto the topics obtained by the models and top ranked terms in these topics are examined to see whether the corresponding terms have any relationships with the underlying events.

To map the hashtags, we calculate the following probability \(p(z|w) = \frac {p(w|z)p(z)}{{\sum }_{z^{\prime }}p(w|z^{\prime })p(z^{\prime })}\) (Hong et al. 2011) where p(w|z) is provided by the trained models and p(z) can be easily estimated by the counts. Intuitively, this probability tells us how likely a topic is to be selected, given the term. We can then compare the time-series of topics and hashtags to determine whether they are similar. Our hypothesis is that if they look similar on the time series, the topic may be good choices for explaining the events behind the hashtags. Notice that we are not seeking the exact match here since the topics have many more terms than a single hashtag and it may explain multiple events. Moreover, we transform the volumes into probabilities. We plot the time series of hashtags and the time series of the selected topics in Figs. 56 and 7.
Fig. 5

Comparison of the probability p(t|z) between the HTMT (top) and the HTM (bottom) models against the hashtag Scotrail. X-axis is the hour umber, Y-axis is the probability

Fig. 6

Comparison of the probability p(t|z) between the HTMT (top) and the HTM (bottom) models against the hashtag Glasgow. X-axis is the hour umber, Y-axis is the probability

Fig. 7

Comparison of the probability p(t|z) between the HTMT (top) and the HTM (bottom) models against the hashtag VoteSNP. X-axis is the hour umber, Y-axis is the probability

From the result, HTMT and HTM (red and blue curves) both smooth out the local fluctuations of the topics (gray curves) from the hashtags shown, while preserving the sharp peaks that may indicate a significant change of content in Twitter. Moreover, HTM tend to “overfit” the multiple spikes in the occurrence of “Scotrail” between 300 and 400 of hours. Also, HTMT can better match the peaks of hashtags, indicating that the method can better reflect real events. This may owe to the fact that the hyper-parameters β in HTMT are governed by the time-dependent functions of both sources, where the rise and fall of these values may give good hints for the model to assign topic to words, leading to a improved performance on temporal dynamics, and provides additional answers to the last research question at the beginning of this section.

6 Conclusion

Mining topics from heterogeneous sources is still a challenge, especially when the temporal dynamics of the sources are intertwined. In this paper, we aim to automatically analyse multiple correlated sources with their corresponding temporal behaviour. The new model goes beyond the existing Source-LDA models because (i) it blends several topic models while preserving the characteristic of each source; (ii) it associates each topic with a deconvolution function that characterise its topic importance and social response over time, which enables our topic model to capture some of the hidden contextual information of feeds.

There are some interesting future work to be continued. First, it will be interesting to investigate more complex aspects related to the temporal dynamics of data stream, e.g., the sentiment shift. Additionally, in order to model and gain insight from real events, topics can be linked with a group of named entities such that each topic can be largely explained by these entities and their relations in the knowledge graphs, e.g., Freebase.3

Footnotes

Notes

Acknowledgements

We thank the anonymous reviewer for their helpful comments. We acknowledge support from the EPSRC funded project named A Situation Aware Information Infrastructure Project (EP/L026015) and from the Economic and Social Research Council [grant number ES/L011921/1]. This work was also partly supported by NSF grant #61572223. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the sponsor.

References

  1. Alsumait, L., Barbará, D., & Domeniconi, C. (2008). Online lda: Adaptive topic model for mining text streams with application on topic detection and. ICDM’08.Google Scholar
  2. Asur, S., Huberman, B.A.s., Szabo, G., & Wang, C. (2011). Trends in social media: Persistence and decay. Available at SSRN 1755748.Google Scholar
  3. Blei, D.M., & Lafferty, J.D. (2006). Dynamic topic models. In ICML ’06 (pp. 113–120).Google Scholar
  4. Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.zbMATHGoogle Scholar
  5. Cha, M., Haddadi, H., Benevenuto, F., & Gummadi, P.K. (2010). Measuring user influence in twitter: the million follower fallacy. ICWSM ’10, 10(10–17), 30.Google Scholar
  6. Crane, R., & Sornette, D. (2008). Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences, 105(41), 15649–15653.CrossRefGoogle Scholar
  7. Doyle, G., & Elkan, C. (2009). Accounting for burstiness in topic models. ICML ’09.Google Scholar
  8. Dubey, A., Hefny, A., Williamson, S., & Xing, E.P. (2013). A nonparametric mixture model for topic modeling over time.Google Scholar
  9. Fiscus, J.G., & Doddington, G.R. (2002). Topic detection and tracking. chapter Topic Detection and Tracking Evaluation Overview, pp. 17–31.Google Scholar
  10. Gaikovich, K.P. (2004). Inverse problems in physical diagnostics. Nova Publishers.Google Scholar
  11. Ghosh, R., & Asur, S. (2013). Mining information from heterogeneous sources: A topic modeling approach. In Proc. of the MDS Workshop at the 19th ACM SIGKDD (MDS-SIGKDD’13).Google Scholar
  12. Heinrich, G. (2005). Parameter estimation for text analysis. Technical report.Google Scholar
  13. Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 45, 256–269.zbMATHGoogle Scholar
  14. Hong, L., Dom, B., Gurumurthy, S., & Tsioutsiouliklis, K. (2011). A time-dependent topic model for multiple text streams.Google Scholar
  15. Hong, L., Yin, D., Guo, J., & Davison, B.D. (2011). Tracking trends: incorporating term volume into temporal topic models.Google Scholar
  16. Kirkeby, O., Nelson, P., Hamada, H., Orduna-Bustamante, F., et al. (1998). Fast deconvolution of multichannel systems using regularization. IEEE Transactions on Speech and Audio Processing, 6(2), 189–194.CrossRefGoogle Scholar
  17. Lau, J.H., Collier, N.s., & Baldwin, T. (2012). On-line trend analysis with topic models: ∖# twitter trends detection topic model online. In COLING (pp. 1519–1534).Google Scholar
  18. Leskovec, J., Backstrom, L., & Kleinberg, J. (2009). Meme-tracking and the dynamics of the news cycle.Google Scholar
  19. Li, T., Ma, S., & Ogihara, M. (2004). Entropy-based criterion in categorical clustering.Google Scholar
  20. Mallat, S. (1999). A wavelet tour of signal processing. Academic press.Google Scholar
  21. Masada, T., Fukagawa, D., Takasu, A., Hamada, T., Shibata, Y., & Oguri, K. (2009). Dynamic hyperparameter optimization for bayesian topical trend analysis. CIKM ’09.Google Scholar
  22. Miller, J.W., & Alleva, F. (1996). Evaluation of a language model using a clustered model backoff. volume 1 of ICSLP’ 96, pages 390–393.Google Scholar
  23. Minka, T. (2000). Estimating a dirichlet distribution.Google Scholar
  24. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents.Google Scholar
  25. Tsytsarau, M., Palpanas, T., & Castellanos, M. (2014). Dynamics of news events and social media reaction.Google Scholar
  26. Wang, C., Blei, D., & Heckerman, D. (2012). Continuous time dynamic topic models. arXiv:1206.3298.
  27. Wang, X., & McCallum, A. (2006). Topics over time: a non-markov continuous-time model of topical trends.Google Scholar
  28. Zhai, C.X., Velivelli, A., & Yu, B. (2004). A cross-collection mixture model for comparative text mining. In KDD ’04 (pp. 743–748).Google Scholar
  29. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., & Li, X. (2011). Comparing twitter and traditional media using topic models. In ECIR’11 (pp. 338–349).Google Scholar

Copyright information

© The Author(s) 2017

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.School of Computing ScienceUniversity of GlasgowGlasgowUK
  2. 2.School of Computing ScienceUniversity of StrathclydeGlasgowUK

Personalised recommendations