Abstract
With the rapid proliferation of Web 2.0, usergenerated content (UGC), which is formed by the public to reflect their views and voice, presents rich and timely feedback on news events. Existing research either studies the common and private features between news and UGC, or describes the ability of news media to influence the public opinion. However, in the current highly mediauser interactive environment, investigating the public influence on news is of great significance to risk and credible management for government and enterprises. In this paper, we propose a novel topicaware dynamic Granger test framework to quantify and characterize the public influence on news. In particular, we represent words and documents as distributed lowdimensional vectors which facilitates the subsequent topic extraction. Then, a topicaware dynamic strategy is proposed to transfer news and UGC streams into topic series, and finally we apply Granger causality test to investigate the public influence on news. Extensive experiments on 45 diverse realworld events demonstrate the effectiveness of the proposed method, and the results show promising prospects on predicting whether an event will be properly handled at its early stage.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Social media presents rich and timely feedback on news events that take place around the world. According to the report from Pew Research Center, 63Â % of social users from Twitter and Facebook accessed news online, and roughly a quarter of them actively expressed their opinions on daily news through these social applicationsÂ [2]. The various usergenerated content not only fuels the news with different events from different perspectives, but also spurs additional news coverage in the event. On the other hand, reading the social media, understanding and responding to public voice timely and objectively, can help news media promote its influence on the public.
Example. In the event AsiaPacific Economic Cooperation (APEC) 2014 in China, region cooperation, global economic were the topics supposed to be reported by news media. However, in fact, social media users posted significant amount of comments on APEC blue â€“ rare blue sky in Beijing during the summit due to emission reduction campaign directed by Chinese government. Following that, news media quickly followed and paid great attention on this topic which was beyond the original news agenda. We found that 38 of the 176 news articles on Sina were related to the APEC blue. Furthermore, how news responds to the public voice has a significant impact on the government credibility. For example, two severe earthquakes struck China in 2014, i.e., Yunnan and Sichuan. Both reports covered the major topics, but in Yunnan earthquake, news media responded the public timely and pictured a comprehensive image of event progress from the perspectives of the public, and thus harvested better support from the public. Therefore, investigating the public influence on news is of great benefit to public opinion management and government credibility improvement.
Related researches on mutually news and UGC stream analysis mainly follows three lines. The first studies event evolution within individual news streamÂ [1, 8], e.g., Mei and Zhai adapt PLSA to extract topics in news stream, then identify coherent topics over time, and finally analyze their lifecycleÂ [17]. The second focuses on simultaneously modeling multiple news streams, e.g. identifying the characteristics of social media and news mediaÂ [26] or their common and private featuresÂ [10, 23]. But both of them pay little attention on the interactions between two streams which could inspire their coevolution. The last comes from the journalism communication. It applies agenda setting theoryÂ [16] to analyze the interactions between different news agenda, and it is often completed via questionnaire survey or manual work on limited events. However, in the era of social media, agenda setting is not a oneway pattern from news to the public, but rather a complex and dynamic interaction.
Detecting the public influence on news poses unique technical challenges: (i) most researches use latent topic to model news and UGC, but the traditional word distribution representationÂ [5, 17] suffers from the sparsity problem due to the UGCâ€™s short and fragmented characteristics, making it difficult to track topic changes; (ii) how to detect the crossmedia influence links remains another problem, since the commonlyused measures (e.g., KLdivergenceÂ [12, 17]) often leads to heuristic results without statistical explanation.
In this paper, we propose a novel topicaware dynamic Granger test framework to automatically study the public influence on news media. To address the sparsity problem, we first represent word as lowdimensional word vectors through skipgram modelÂ [19], and further reform word representation via sparse coding to capture the latent semantic of each dimension. Then we employ Granger causality testÂ [9] to theoretically detect the public influence on news. Particularly, for a pair of topics extracted from UGC and news respectively, we propose a topicaware dynamic strategy to chronologically split those topicrelated documents into disjoint bins with dynamic time intervals, calculate the topic representations based on the documents falling into each bin, and apply the multivariate Granger test to judge if the UGCtonews influence exists. Finally, we quantify the influenceÂ [12] based on the discovered influence links, and validate the influence measures by calculating their correlations with the professional, manual results provided by China Youth Online.
The main contributions can be summarized as follows:

We address problem of analyzing public influence analysis on news through a unified Grangerbased framework. Extensive experiments are conducted on 45 realworld events to demonstrate its effectiveness, and results could provide useful guidance on handling public hot topics in event reporting.

We propose a novel textual feature extraction method. Instead of directly using the popular word2vec, it further maps word and document into a lowdimensional space with each dimension denoting a more compact semantic thus facilitates topic extraction and representation.

To track the temporal changes of topic pair from news and UGC respectively, we propose a novel topicaware dynamic binning strategy, splitting both streams into chronological bins to achieve smooth topic representations of each bin.
The rest of the paper is organized as follows. In Sect.Â 2, we first define the related concepts and the problem of influence analysis from UGC to news. SectionÂ 3 presents our proposed textual feature extraction method and Grangerbased influence analysis. Our experimental results are reported in Sect.Â 4. SectionÂ 5 reviews the related literatures, and finally Sect.Â 6 concludes this paper with future research directions.
2 Problem Definition
A particular event often brings forth two correlated streams, namely news articles from newsroom form a news stream and the public voice from different social applications converge into a UGC stream. Both news stream \( NS \) and UGC stream \( US \) are text streams, which are defined as follows:
Definition 1
Text Stream. A text stream \(TS=\langle s_{1},s_{2},\ldots ,\) \(s_{n} \rangle \) is a sequence of documents, where \(s_{i}\ (i=1, 2, \ldots , n)\) is associated with a pair (\(d_{i}\), \(t_{i}\)), where \(d_{i}\) is a document comprising a list of words and \(t_{i}\) is the publish time in nondescending order, i.e. \(t_{i} \le t_{i+1}\).
It has been shown that news and UGC streams are mutually dependentÂ [24]. Topic, which bridges these two different streams, plays an important role. In order to study the crossstream interactions, we first define topic as follows:
Definition 2
Topic. Conceptually, topic z expresses an event related subject/theme within a time period. Mathematically, topic \(\mathbf {z}\) is characterized as a vector with each dimension denoting a word feature or a latent aspect. Topic z covers multiple documents (news articles or users comments).
The interaction between media, public and government has been theoretically studied in journalism communication, e.g., the agenda setting theory^{Footnote 1} evaluated the ability of mass media to influence the salience of topics on the publicÂ [15]. Nowadays, the proliferation of social media is changing the way of news diffusion, i.e., the public may inversely affect or even drive the news media. It is useful to explore to what extent the traditional news depends on social media and how long the public influence lasts, thus we condense the following research problem.
Definition 3
Analyzing Public Influence on News. Given a news stream \( NS \) and a UGC stream \( US \), influence analysis from UGC to news aims to discover a set of influence links \(\{(\mathbf {z}_{u},\mathbf {z}_{n},\zeta )\}\), where \(\mathbf {z}_{u}\in Z_{u}\) and \(\mathbf {z}_{n}\in Z_{n}\) are topics extracted from \( US \) and \( NS \) respectively, and \(\zeta \in \{0,1\}\) indicates whether \(\mathbf {z}_{u}\) influences (or contributes to) \(\mathbf {z}_{n}\).
From the definition above, topic representation and extraction, influence detection are two major steps to complete the novel task. As mentioned in the introduction, existing methods suffer from various technical deficiencies, i.e., sparse representation and lack of theoretical foundation. To tackle these issues, we put effort on the following two problems: (i) given news and UGC streams, properly represent the documents and extract latent topics from both streams; (ii) given a topic pair \((\mathbf {z}_{u},\mathbf {z}_{n})\), determine if there exists a causality link and provide a statistical evaluation on how \(\mathbf {z}_{u}\) contributes to \(\mathbf {z}_{n}\).
3 Our Approach
In this paper, we propose a topicaware dynamic Grangerbased method to automatically detect the influence from UGC to news. Specifically, we develop a text representation method to better represent news and UGC in a lowdimensional space and extract their corresponding topics (Sect.Â 3.1). We incorporate temporal information to transform news and UGC topics into serialized representations and apply Granger causality test to detect the public influence on news (Sect.Â 3.2).
3.1 Text Representation and Topic Extraction
Text representation and topic extraction aims to properly represent the documents in \( NS \) and \( US \) and extract topics \(Z_n\) and \(Z_u\). However, traditional TFIDF representations suffer problems of the curse of dimensionality and feature independence assumption in dealing with the short and fragmented UGC. These methods often ignore the semantic relationships among word features which leads to document sparse representation with many zero features values.
To alleviate the sparse representation, many methods have been proposed to unveil the hidden semantics of words, such as topic models (e.g., LDAÂ [3]) and external knowledge enrichment (e.g., ESAÂ [7]). However, topic models rely much on the word cooccurrence that cannot be accurately computed with short texts, while ESA requires plenty of highquality knowledge, which is often not available in practice. In this paper, we propose a novel textual feature extraction pipeline, which gradually maps word and document into a low dimensional space where each dimension represents a unique semantic meaning. It consists of the following three steps:
Word Vectorization.Â Word is the basic element in text, so we first transform words into continuous lowdimensional vectors. Let V denote the vocabulary in \( NS \) and \( US \), we employ skipgram modelÂ [19] to learn a mapping function: \(V \rightarrow \mathbb {R}^M\), where \(\mathbb {R}^M\) is a Mdimensional vector. Specifically, given a document \(s\in NS\cup US\) associated with word sequence \(\langle w_{1},w_{2},\ldots , w_{W}\rangle \), skipgram model maximizes the cooccurrence probability among words that appear within a contextual window k:
The probability \(p(w_{j}w_{i})\) is formulated as:
where \(\mathbf {w}_{i}\in \mathbb {R}^M\) is the Mdimensional representation of \(w_{i}\).
Midlevel Feature Learning. Intuitively, the document representation can be achieved via word vector composition. However, each dimension in word vector represents a latent meaning and word semantic scatters over almost all dimensions, simple composition of individual word vectors ignores the potential correlation between dimensionsÂ [20]. To prevent possible information loss by simple composition, we reconstruct each word vector into a midlevel featureÂ [4], where each dimension represents a unique dense semantic. In other words, we learn a \(\mathbb {R}^M \rightarrow \mathbb {R}^N\) mapping, and it is typically a sparse coding problem, whose objectiveÂ is:
where \(\mathbf {w}_{i}\in \mathbb {R}^M\) is the vector obtained in word vectorization; \(\mathbf {w}_{i}^{*}\in W^{*}\subseteq \mathbb {R}^N\) is the Ndimensional sparse representation (\(N > M\)); \(\mathbf {D}\) is an \(M\times N\) matrix with each column denoting a dense sematic; \({\Vert \cdot \Vert }_{1}\) denotes the \(\ell _{1}\)norm of input vector; \(\lambda > 0\) is a hyperparameter controlling the sparsity of the result representation, i.e., larger (or smaller) \(\lambda \) induces more (or less) sparseness of \(\mathbf {w}_{i}^{*}\). Because the vocabulary V usually contains tens of thousands of words, optimization of the nonconvex problem would be very time consuming.
To efficiently solve the problem, we apply a twostep approximation method. Firstly, we learn the matrix \(\mathbf {D}\) offline. We cluster the learned word vectors into N clusters through Kmeans where each cluster denotes a compact semantic, and use the cluster centers as the columns of D. Secondly, based on the assumption that locality is more essential than sparsityÂ [22], we select the Knearest neighbors in D for each word \(\mathbf {w}_{i}\) based on Euclidean distance, and then adopt the Localityconstraint Linear Coding (LLC) to learn its transformation \(\mathbf {w}_{i}^{*}\):
where \(\mathbf {B}_{i}\) is the Knearest neighbors to \(\mathbf {w}_{i}\) in \(\mathbf {D}\). The problem could be solved analytically by:
where \(\mathbf {V}_{i}={(\mathbf {B}_{i}\mathbf {1}\mathbf {w}^{\mathrm {T}}_{i})}^\mathrm {T}(\mathbf {B}_{i}\mathbf {1}\mathbf {w}^{\mathrm {T}}_{i})\) denotes the covariance matrix.
Document Representation and Topic Extraction. We employ spatial pooling to represent each document as a Ndimensional vector \(\mathbb {R}^N\) based on the learned sparse word vectors. Given a document \(s_{i}\) consisting W words with vector representations \(\mathbf {w}_{i}^{*}, i = 1,2,\ldots ,W\), we try two different pooling functions to obtain the document representation \(\mathbf {s}_{i}\):
where \(\mathbf {s}_{i}\) denotes the final representation of \(s_{i}\) and \(s_{ij}_{j=1}^{N}\) is the jth entry. Note that different pooling functions assume the underlying distributions differently. Once completing the document representation, we feed the news and comment vectors into Kmeans algorithm separately to obtainÂ topic sets \(Z_{n}\) and \(Z_{u}\). The achieved topics have more compact distributed representations than TFIDF, which is convenient to further computation and analysis.
3.2 Topic Influence Detection
Topic influence detection analyzes the relationship between news and UGC topics, which behaves as interstream influence. Normally, KLdivergence is employed to evaluate topic transition within news streamÂ [13, 17] or topic interaction across streamsÂ [12], but the idea is heuristic and results are often restricted within a too short time period to track the topic evolution.
Therefore, we perform the influence detection in a more theoretical way through Granger causality test^{Footnote 2}. Its basic idea is that a cause should be helpful in predicting the future values of a time series, beyond what can be predicted solely based on its own historical valuesÂ [9]. That is to say, a time series x is to Granger cause another time series y, if and only if regressing for y in terms of both past values of y and x is statistically significantly more accurate than regressing for y in terms of past values of y only.
GrangerBased Influence Detection.Â In this paper, Granger causality analysis is performed on two topics \(z_{u}\in Z_{u}\) and \(z_{n}\in Z_{n}\) to test whether \(z_{u}\) is the Granger cause of \(z_{n}\).
In the previous subsection, we achieve the news and UGC topic sets and their associated documents, but the Granger causality test requires two time series. So we need to turn topics in \(Z_{n}\) and \(Z_{u}\) into timevarying topic series. For each \(\mathbf {z} \in Z_{n}\cup Z_{u}\), we need to represent it as \(\langle \mathbf {z}^{t}\rangle _{t=1}^{T}\) where \(\mathbf {z}^{t}\) is the status of topic z at the tth interval and T is the size of time intervals. A straightforward way is to partition both streams into disjoint slices with fixed time intervals (e.g., one day), i.e., equalsize binning. An alternative is equaldepth binning, i.e., evenly partitioning all documents into T bins. For an obtained partition \(\langle S^{t}\rangle _{t=1}^{T}\), the representation of topic z at the tth bin \(\mathbf {z}^{t}\) could be simply computed via averaging the related document vectors within that bin:
where \(S_{z}^{t}\) denotes the documents within tth bin that are related to topic z.
Once we get the timevarying representations of two target topics \(\langle \mathbf {z}_{n}^{t}\rangle _{t=1}^{T}\) and \(\langle \mathbf {z}_{u}^{t}\rangle _{t=1}^{T}\), we first fit two vector autoregressive models (VAR) over these two series:
where (8) predicts a news topic \(\mathbf {z}_{n}^{t}\) at time stamp t purely based on its historical values, i.e., \(\mathbf {z}_{n}^{ti}\), while (9) considers the historical values from both news and UGC streams, i.e., \(\mathbf {z}_{n}^{ti}\) and \(\mathbf {z}_{u}^{ti}\), for prediction; q is a predefined maximum lag to measure how long the influence lasts; \(\mathbf {r}_{u}\) and \(\mathbf {r}\) denote the residuals with/without considering the topic \(z_{u}\).
Then, to test whether or not (9) results in a better regression than (8) with statistical significance, we apply an Ftest (some other similar tests could also be chosen). More specifically, we calculate the residual sum of squares RSS and \(RSS_{u}\), based on which we obtain the Fstatistic:
Given a confidence coefficient \(\alpha \), we say \(\mathbf {z}_{u}\) Granger causes \(\mathbf {z}_{n}\) if F is greater than a predefined \(F_{\alpha }\), i.e. \(\zeta = 1\) as defined in Sect.Â 2, and otherwise \(\zeta = 0\).
However, both streams, especially news articles, are often generated nonuniformly. The equalsize binning performs poorly on such streams since it produces many empty intervals without any news or comments, and the equaldepth binning often leads to extremely unbalanced time spans. Either empty interval or unbalanced spans has side effect on Granger test, making it failed or meaningless.
Topicaware Dynamic Granger Test.Â To address the uneven distribution problem, we propose a topicaware dynamic binning strategy to partition both streams into several disjoint intervals. The motivation for topicaware is that: different topics follow their unique patterns and show various distributions along timeline and the Granger causality test actually processes a topic pair rather than the whole streams at one time, thus one partition only need to deal with documents within target topics from news and UGC respectively. And the dynamic binning aims to alleviate problem of the uneven distribution. Let \(S_{z}\) denote the streaming documents associated with topic z, \(\langle S_{z}^{t}\rangle _{t=1}^{T}\) is a partition result, we define the following two types of dispersion:

\(dis_{amount}\): the difference between the largest and the smallest bin size with bin size is defined as the number of contained documents;

\(dis_{span}\): the difference between the largest and the smallest time span.
Our objective is to balance these two dispersions, namely,
Due to the extremely unbalanced volume of news and comments, we perform the optimization on news stream and the comments just follow. The problem could be solved efficiently using dynamic programming (where dynamic comes) and the best solution is always availableÂ [13].
4 Experiments
In this section, we first briefly introduce our datasets, and then present the detailed experimental results on topic extraction, topic influence detection and further analysis.
4.1 Dataset Description
To evaluate the effectiveness of the proposed methods, we prepare the following two kinds of datasets:
Datasets from Houâ€™s paperÂ [12]. They are composed of five datasets containing four international events: the Federal Government Shutdown in both Chinese and English (cFGS/eFGS), Jang Sungtaekâ€™s (Jang), The Boston Marathon Bombing (Boston) and India Election (India). They are collected from influential news portals and social media platforms (i.e., Sina, New York Times, Twitter), and the detailed statistics are summarized in TableÂ 1. These datasets are used to evaluate the effectiveness of our topic extraction and influence detection.
Datasets from China Youth Online (CYOL^{Footnote 3}). CYOL is one of the biggest and leading public opinion analysis website in China. It monthly publishes opinion index based on questionnaire surveys from experts and scholars, civil servants, media people, opinion leaders and ordinary Internet users. The index includes five welldefined metrics: information coverage, activeness, response arrival rate, response recognition rate and satisfactory. For each event reported by CYOL in 2014, we crawled the news articles and comments from Sina^{Footnote 4} if there existed a corresponding special issue. Finally, we collected 40 events, and for each event, there are 140 news articles and 12,849 comments on average. Due to the space limit, the detailed statistics and data will be published later. We incorporated these datasets and published opinion index to evaluate the influence measures that are automatically calculated based on our approach.
4.2 Results for Topic Extraction
In this section, we report the evaluation on text representation and topic extraction, including the experiment setup (settings, baselines and metrics), comparison results and the parameter analysis.
Settings.Â We use the gensim^{Footnote 5} implementation of word2vec to learn word vectors with \(M=200\), and Kmeans to generate the transform matrix D with \(N=\{\)128, 256, 512, 1024\(\}\). For midlevel feature learning, we apply LLC with various Knearest neighbors, with \(K \in \{1,5,10,50\}\). The parameter \(\lambda \) is set to be \(1e^{4}\) as the author suggested.
Baselines.Â We use DeepDoc to denote our proposed text representation method, and compare it with TFIDF based method (TFIDF) and stateoftheart topic models on news and UGC, i.e., Document Comment Topic Model (DCT)Â [11] and Cross Dependence Temporal Topic Model (CDTTM)Â [12].
Metrics.Â As for the evaluation metrics, we calculate the inner/intercluster distance for all topics. The innercluster distance (inner) is defined as the average distance between documents withinÂ topic, and a smaller value indicates a compact cluster. The intercluster distance (inter) is the average distance from one topic to all the other topics, and a larger value indicates a better result. We also calculate their relative ratios (ratio) where a bigger value shows better performance.
Comparison Results.Â TableÂ 2 presents the comparison results, from which we can see: (i) macroscopically, our proposed DeepDoc outperforms three baselines consistently, and DCT is more steady than other methods while the TFIDF representation obtains the worst performance. (ii) under this measurement, CDTTM is not so sensitive to the stream distribution as described inÂ [12], and DeepDoc does not have this problem as we do not include temporal information in clustering.
Parameter Analysis.Â Then the Boston dataset is chosen to investigate the effects of the number of neighbours, pooling function and the number of matrix columns, and the results are presented in TableÂ 3. We have the following conclusions:

Number of neighbours (K). Regardless of various other settings, generally small number of neighbors leads to better clustering results. This is a promising finding, because the smaller the number of neighbors used (i.e., the more sparse the codes are), the faster the computation will be run, and the less the memory will be consumed.

Pooling function. Different choices of pooling functions lead to very different clustering results. The root mean square pooling achieves better performance under almost every settings than average pooling, and the smaller the code sparseness (larger K and smaller matrix), the gap between these two pooling functions is more significant.

Number of matrix columns. It actually denotes the dimensions of transformed space. Intuitively, if the number of dimension is too small, the midlevel representation will lose discriminative power, but words from the same category of documents will be less similar if the size is too big. Here, we mainly focus on the tradeoff between reasonably smaller and bigger size. As can be seen from the results, larger size leads to better results when \(K>10\), while it is likely that smaller matrix is sufficient under higher level of sparseness.
4.3 Results for Topic Influence Detection
To evaluate our proposed topicaware dynamic Granger test method (TDG), we perform three series of experiments, namely, (1) the overall comparison with KLdivergence based method inÂ [12], (2) the comparison of different binning methods, and (3) the effect of the maximum lag.
TDG â€“ KL Divergence.Â Hou et al. evaluated their method on manually labeled data, and it achieved comparable results to the human annotation. To make the comparison fair, we compare the Granger results with \(\alpha =0.9/0.8\) with their top 10Â %/20Â % links (Hou et al. included links with distance less than the median value). Through manual evaluation, the Granger test achieves 94Â % precision while KL gets only 82Â %, indicating our method significantly outperform theirs. This comes as no surprise because their KLdivergence based method only finds similar patterns in the other stream (it assumes similar topics share similar patterns along timeline which may not hold) while our proposed Granger based method discovers the most useful topics in UGC that contribute to predicting the target news topic and thus are more likely to influence the news.
Dynamic Binning â€“ Equal Size Binning.Â TableÂ 4 shows the number of detected Granger causal links when different time split methods are applied. We can find that, (i) equalsize: the equalsize binning gets the worst performance because the streams (especially news stream) distribute nonuniformly and it often leads to zero vectors for bins with on documents. Though mean linear interpolation is employed to deal with the zeros, the results are still not so satisfactory. (ii) dynamic: dynamic binning optimize (11) over whole news stream without distinguish topics. It can handle the uneven distributed streams to some extent, thus finds more influence links. (iii) topic: since our proposed method tests a pair of topics every time and different topics may follow different patterns, while the dynamic binning is applied on the whole streams, thus it might not perform well on different topic pairs. Therefore, the topicaware binning further improves the performance.
How Long the Influence Lasts.Â To choose a proper maximum lag q (i.e., how many historical values are included in the regression), we select five topic pairs to conduct Granger causality test with maximum lag ranging from 1 to 10, and determine the proper value that achieves the best F statistics (divided by \(F_{0.8}\) due to the different time spans). TableÂ 5 shows the results from 3 to 7, we observe that the \(F/F_{0.8}\) increases initially until \(q=5\) to reach stable status. We therefore execute all the Granger test with q set as 5. Note that here \(q=5\) does not mean 5Â days since topic aware binning is used for stream split, and actually the average time difference is about 3.2Â days, which tells us that news and UGC in the previous 3Â days have much more influence on the current news report.
4.4 Influence Usage Analysis
This experiment exploits whether our automatically obtained results are consistent with the objective CYOL public opinion index. Specifically, with our achieved influence links \(\{(z_u,z_n)\}\), we quantify the public influence on news through news response rate(NRR), promptness(NRP) and effect(NRE) as defined inÂ [12]. Their comparable measures in CYOL Public Opinion Index are information coverage(IC), response activity(RA) and satisfactory(SA). We compute three correlation coefficients for the 3 pairs of measures NRRIC, NRPRA, and NRESA respectively, and higher correlations indicates better results. For comparison, we use LDA+KLdivergence, Houâ€™s methods(CDTTM+KL) as our baselines. We further try to only use first half of the event data for analysis (Ours\(^\frac{1}{2}\)) to test whether it is helpful in predicting the future influence. TableÂ 6 shows the comparison results.
As shown in TableÂ 6, our method achieves higher correlations with the CYOL measures than other two methods. Furthermore, we notice that only using the first half of event data, our method can achieve comparable results with those on all data. This implies that it can be used on predicting whether an event could be handled properly at early stage.
Case Review.Â Now we review the events mentioned in the introduction. The APEC 2014 summit shows a good example that the social media can influence news media. Besides APEC blue, we identify another topic beyond the scheduled ones, i.e., tourist. It actually covered the parttime activities of the dignitaries Mrs, especially their clothing. The news media started to report the parttime activities causally. However, the public was very enthusiastic about the Mrsâ€™ tourist and discussed a lot about their clothing. To satisfy peopleâ€™s curiosity, news presented systematic introduction of the first ladyâ€™s activities and dress. Then, we compare the news response in the two earthquakes: both reports covered the major topics â€” both NRRs are pretty high (Yunnan 84Â % and Sichuan 82Â %); but in Yunnan earthquake, news media responded the public more timely â€” the NRP in Yunnan is much smaller than that in Sichuan, roughly 0.8Â day v.s. 1.4Â days. The final satisfactory shows that it is very important to properly handle the heatedlydiscussed topics. Our analysis could summarize about which topic that news should response at what time, thus benefits the public opinion management.
5 Related Work
Our work is related to three lines of research as follows:
5.1 Distributed Text Representation
Representing words in continuous vector space has been an appealing pursuit since 1986Â [25]. Recently, Mikolov et al. developed efficient method to learn high quality word vectorsÂ [19], and a host of followup achievements have been made on phrase or document representation, such as paragraphtovectorÂ [20]. Different from these attempts, we are inspired to borrow the stateoftheart feature extraction pipeline in computer visionÂ [4] to represent word and document in a new space where each dimension denotes a more compact semantic than directly using word2vec.
5.2 Social News Analysis and Topic Evolution
The proliferation of social media encourages researchers to study its relationship between traditional news media, e.g., Zhao et al. employed TwitterLDA to analyze Twitter and New York Times and found Twitter actively helped spread news of important world events although it showed low interests in themÂ [26]. Petrovic et al. examined the relation between Twitter and Newsfeeds and concluded that neither streams consistently lead the other to major eventsÂ [21]. Besides the common and specific characteristics of news and social media, we pay more attention on the crossstream interaction.
As for the topic evolution, Mei et al. solved the problem of discovering evolutionary theme patterns from single text streamÂ [17], Hu et al. modeled the topic variations and identified the topic breakpoints in news streamÂ [13]. Wang et al. aimed at finding the burst topics from coordinated text streams based on their proposed mixture modelÂ [24]. Lin et al. formalized the evolution of a topic and its latent diffusion paths in social community as a joint inference problem, and solved it through a mixture model (for text generation) and a Gaussian Markov Random Field (for userlevel social influence)Â [14]. In this paper, we study the interplay of news and UGC within specific events, trying to analyze the crossmedia influence and figure out how they coevolve over time.
5.3 Agenda Setting and Granger Causality
Agendasetting is the creation of public awareness and concern of salient issues by the news media. Mccombs and Shaw discussed the function of mass media in agenda settingÂ [16] in 1972. Many researchers studied the interactions between public agenda and news agenda, e.g., Meraz employed time series analysis to measure the influence in political blog and news mediaÂ [18]. Our work falls into the secondlevel agendasetting (also called attribute agendasetting), and the major advantage of our framework is that, the attributes are predefined and we extract the latent topics automatically.
The Granger causality testÂ [9] is a statistical hypothesis test for determining whether one time series is useful in forecasting the other one. It has been utilized in many areas for causality analysis or prediction, e.g., [6] adapted it to model the temporal dependence from largescale time series dataÂ [6]; Chang et al. used it in Twitter user influence modeling. In this paper, we apply the agendasetting theory and multivariate Granger test to automatically analyze how the social media influence traditional news.
6 Conclusion
In this paper, we analyze the public influence on news through a Grangerbased framework: first represent words and documents in distributed lowdimensional space and extract topics from news and UGC streams, then dynamically split streams to achieve changing topic representations on which we employ Granger causality test to detect influence links. Experiments on realworld events demonstrate the effectiveness of the proposed methods and the results show good prospects on predicting whether an event could be properly handled.
It should be note that Granger test attempts to capture an interesting aspect of causality, but certainly is not meant to capture all, e.g., it has little to say about situations in which there is a hidden common cause for the two streams. In the future work, we will try to address the important but challenging issue.
References
Ahmed, A., Xing, E.: Timeline: A dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream. In: Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pp. 20â€“29 (2010)
Barthel, M., Shearer, E., Gottfried, J., Mitchell, A.: The evolving role of news on twitter and facebook. Technical report, Pew Research Center, July 2015
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993â€“1022 (2003)
Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning midlevel features for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2559â€“2566 (2010)
Chang, Y., Wang, X., Mei, Q.Z., Liu, Y.: Towards twitter context summarization with user influence models. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining, pp. 527â€“536 (2013)
Cheng, D., Bahadori, M.T., Liu, Y.: FBLG: a simple and effective approach for temporal dependence discovery from time series data. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 382â€“391 (2014)
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: AAAI, vol. 6, pp. 1301â€“1306 (2006)
Gohr, A., Hinneburg, A., Schult, R., Spiliopoulou, M.: Topic evolution in a stream of documents. In: the 9th SIAM International Conference on Data Mining, pp. 859â€“872 (2009)
Granger, C.W.: Investigating causal relations by econometric models and crossspectral methods. Econometrica J. Econometric Soc. 37, 424â€“438 (1969)
Hong, L., Dom, B., Gurumurthy, S., Tsioutsiouliklis, K.: A timedependent topic model for multiple text streams. In: Proceedings of the 17th ACM International Conference on Knowledge Discovery in Data Mining, pp. 832â€“840 (2011)
Hou, L., Li, J., Li, X.L., Qu, J., Guo, X., Hui, O., Tang, J.: What users care about: a framework for social content alignment. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp. 1401â€“1407 (2013)
Hou, L., Li, J., Li, X.L., Su, Y.: Measuring the influence from usergenerated content to news via crossdependence topic modeling. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications, pp. 125â€“141 (2015)
Hu, P., Huang, M., Xu, P., Li, W., Usadi, A.K., Zhu, X.: Generating breakpointbased timeline overview for news topic retrospection. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp. 260â€“269 (2011)
Lin, C.X., Mei, Q., Han, J., Jiang, Y., Danilevsky, M.: The joint inference of topic diffusion and evolution in social communities. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp. 378â€“387 (2011)
Lippmann, W.: Public Opinion. Transaction Publishers, New Jersey (1946)
McCombs, M., Shaw, D.: The agendasetting function of mass media. Public Opinion Quart. 36, 176â€“187 (1972)
Mei, Q., Zhai, C.: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM International Conference on Knowledge Discovery in Data Mining, pp. 198â€“207 (2005)
Meraz, S.: Is there an elite hold? traditional media to social media agenda setting influence in blog networks. J. Comput. Mediated Commun. 14(3), 682â€“707 (2009)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the ICLR (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111â€“3119 (2013)
Petrovic, S., Osborne, M., McCreadie, R., Macdonald, C., Ounis, I., Shrimpton, L.: Can twitter replace newswire for breaking news? In: Proceedings of the 7th International AAAI Conference on Weblogs and Social Media (2013)
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Localityconstrained linear coding for image classification. In: Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3360â€“3367 (2010)
Wang, X., Zhang, K., Jin, X., Shen, D.: Mining common topics from multiple asynchronous text streams. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, pp. 192â€“201 (2009)
Wang, X., Zhai, C., Hu, X., Sproat, R.: Mining correlated bursty topic patterns from coordinated text streams. In: Proceedings of the 13th ACM International Conference on Knowledge Discovery in Data Mining, pp. 784â€“793 (2007)
Williams, D., Hinton, G.: Learning representations by backpropagating errors. Nature 323, 533 (1986)
Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European Conference on Information Retrieval, pp. 338â€“349 (2011)
Acknowledgement
The work is supported by 973 Program (No. 2014CB340504), NSFCANR (No. 61261130588), and NSFC key project (No. 61533018), Tsinghua University Initiative Scientific Research Program (No. 20131089256) and THUNUS NExT CoLab. Thank Prof. Chua TatSeng, Dr. Hanwang Zhang and Xindi Shang from National University of Singapore for discussion on text representation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Â© 2016 Springer International Publishing AG
About this paper
Cite this paper
Hou, L., Li, J., Li, XL., Jin, J. (2016). Detecting Public Influence on News Using TopicAware Dynamic Granger Test. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9852. Springer, Cham. https://doi.org/10.1007/9783319462271_21
Download citation
DOI: https://doi.org/10.1007/9783319462271_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783319462264
Online ISBN: 9783319462271
eBook Packages: Computer ScienceComputer Science (R0)