Abstract
There is an overwhelming number of news articles published every day around the globe. Following the evolution of a news story is a difficult task given that there is no such mechanism available to track back in time to discover and study the hidden relationships between relevant events in digital news feeds. The techniques developed so far to extract meaningful information from a massive corpus rely on similarity search, which results in a myopic loopback to the same topic without providing the needed insights to hypothesize the origin of a story that may be completely different than the news today. In this paper, we present an algorithm that mines historical news data to detect the origin of an event, segments the timeline into disjoint groups of coherent news articles, and outlines the most important documents in a timeline with a soft probability to provide a better understanding of the evolution of a story. Qualitative and quantitative evaluations of our framework demonstrate that our algorithm discovers statistically significant and meaningful stories in reasonable time. Additionally, a relevant case study on a set of news articles demonstrates that the generated output of the algorithm holds the promise to aid prediction of future entities (e.g., actors) in a story.
Similar content being viewed by others
Notes
Evaluation questions available at: https://storyeval.herokuapp.com/.
References
Ahmed, A., Ho, Q., Eisenstein, J., Xing, E., Smola, A.J., Teo, C.H.: Unified analysis of streaming news. In: WWW ’11, pp. 267–276. ACM, New York (2011)
Alias-i: LingPipe 4.1.0 (2008). http://alias-i.com/lingpipe/. Accessed 20 Sept 2016
Allan, J., Gupta, R., Khandelwal, V.: Temporal summaries of new topics. In: SIGIR ’01, pp. 10–18. ACM, New York (2001)
Angulo, J.J., Pederneiras, C.A., Ebner, W., Kimura, E.M., Megale, P.: Concepts of diffusion theory and a graphic approach to the description of the epidemic flow of contagious disease. Public Health Rep. 95(5), 478–485 (1980)
Apache Software Foundation: Hadoop. https://hadoop.apache.org. Accessed 28 July 2017
Binh Tran, G.: Structured summarization for news events. In: WWW ’13 Companion, pp. 343–348. ACM, New York (2013)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Faloutsos, C., McCurley, K.S., Tomkins, A.: Fast discovery of connection subgraphs. In: KDD ’04, pp. 118–127. ACM, New York (2004)
Fang, L., Sarma, A.D., Yu, C., Bohannon, P.: REX: explaining relationships between entity pairs. Proc. VLDB Endow. 5(3), 241–252 (2011)
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: ACL ’05, pp. 363–370. ACL, Stroudsburg (2005)
Gillenwater, J., Kulesza, A., Taskar, B.: Discovering diverse and salient threads in document collections. In: EMNLP-CoNLL ’12, pp. 710–720. ACL, Stroudsburg (2012)
Gu, W., Dong, S., Chen, M.: Personalized news recommendation based on articles chain building. Neural Comput Appl 27(5), 1263–1272 (2016)
Heath, K., Gelfand, N., Ovsjanikov, M., Aanjaneya, M., Guibas, L.J.: Image webs: computing and exploiting connectivity in image collections. In: CVPR ’10, pp. 3432–3439 (2010)
Hossain, M.S., Andrews, C., Ramakrishnan, N., North, C.: Helping intelligence analysts make connections. In: AAAIWS’11, pp. 22–31. AAAI Press, Menlo Park (2011)
Hossain, M.S., Butler, P., Boedihardjo, A.P., Ramakrishnan, N.: Storytelling in entity networks to support intelligence analysts. In: KDD ’12, pp. 1375–1383. ACM, New York (2012)
Hossain, M.S., Gresock, J., Edmonds, Y., Helm, R., Potts, M., Ramakrishnan, N.: Connecting the dots between PubMed abstracts. PloS ONE 7(1), e29509 (2012)
Jo, Y., Hopcroft, J.E., Lagoze, C.: The web of topics: discovering the topology of topic evolution in a corpus. In: WWW ’11, pp. 257–266. ACM, New York (2011)
Kim, D., Oh, A.: Topic chains for understanding a news corpus. In: CICLing’11, pp. 163–176. Springer, Berlin (2011)
Kleinberg, J.: Bursty and hierarchical structure in streams. In: KDD ’02, pp. 91–101. ACM, New York (2002)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Kumar, D., Ramakrishnan, N., Helm, R.F., Potts, M.: Algorithms for storytelling. IEEE Trans. Knowl. Data Eng. 20(6), 736–751 (2008)
Kuzey, E., Vreeken, J., Weikum, G.: A fresh look on knowledge bases: distilling named events from news. In: CIKM ’14, pp. 1689–1698. ACM, New York (2014)
Leskovec, J., Sosič, R.: Snap: a general-purpose network analysis and graph-mining library. ACM Trans. Intell. Syst. Technol. (TIST) 8(1), 1 (2016)
Luo, X., Xuan, J., Lu, J., Zhang, G.: Measuring the semantic uncertainty of news events for evolution potential estimation. ACM Trans. Inf. Syst. 34(4), 24:1–24:25 (2016)
Kader, M. A., Naim, S. M., Boedihardjo, A. P., Hossain, M. S.: Connecting the dots using contextual information hidden in text and images. In: AAAI Conference on Artificial Intelligence (2016)
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
Nallapati, R., Feng, A., Peng, F., Allan, J.: Event threading within news topics. In: CIKM ’04, pp. 446–453. ACM, New York (2004)
Nenkova, A., McKeown, K.: A survey of text summarization techniques. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 43–76. Springer, New York (2012)
Ning, Y., Muthiah, S., Tandon, R., Ramakrishnan, N.: Uncovering news-twitter reciprocity via interaction patterns. In: ASONAM ’15, pp. 1–8. ACM, New York (2015)
Radev, D., Otterbacher, J., Winkel, A., Blair-Goldensohn, S.: NewsInEssence: summarizing online news topics. Commun. ACM 48(10), 95–98 (2005)
Radinsky, K., Davidovich, S., Markovitch, S.: Learning causality for news events prediction. In: WWW ’12, pp. 909–918. ACM, New York (2012)
Rospocher, M., van Erp, M., Vossen, P., Fokkens, A., Aldabe, I., Rigau, G., Soroa, A., Ploeger, T., Bogaard, T.: Building event-centric knowledge graphs from news. Web Semant. 37(C), 132–151 (2016)
Shahaf, D., Guestrin, C.: Connecting the dots between news articles. In: KDD ’10, pp. 623–632. ACM, New York (2010)
Shahaf, D., Guestrin, C., Horvitz, E., Leskovec, J.: Information cartography. Commun. ACM 58(11), 62–73 (2015)
Suen, C., Huang, S., Eksombatchai, C., Sosic, R., Leskovec, J.: NIFTY: a system for large scale information flow tracking and clustering. In: WWW ’13, pp. 1237–1248. ACM, New York (2013)
Wang, X., Zhai, C., Roth, D.: Understanding evolution of research themes: a probabilistic generative model for citations. In: KDD ’13, pp. 1115–1123. ACM, New York (2013)
Warcbase: Named Entity Recognition (2016). https://lintool.github.io/warcbase-docs/Spark-Named-Entity-Recognition/. Accessed 20 Sept 2016
Wu, C., Wu, B., Wang, B.: Event evolution model based on random walk model with hot topic extraction. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q.Z. (eds.) Proceedings of advanced data mining and applications: 12th international conference, ADMA 2016, Gold Coast, QLD, Australia, 12–15 Dec 2016, pp. 591–603. Springer, Cham. https://doi.org/10.1007/978-3-319-49586-6_42 (2016)
Yan, R., Wan, X., Otterbacher, J., Kong, L., Li, X., Zhang, Y.: Evolutionary timeline summarization: a balanced optimization framework via iterative substitution. In: SIGIR ’11, pp. 745–754. ACM, New York (2011)
Yang, Y., Ault, T., Pierce, T., Lattimer, C.W.: Improving text categorization methods for event tracking. In: SIGIR ’00, pp. 65–72. ACM, New York (2000)
Yang, Y., Carbonell, J.G., Brown, R.D., Pierce, T., Archibald, B.T., Liu, X.: Learning approaches for detecting and tracking news events. IEEE Intell. Syst. Appl. 14(4), 32–43 (1999)
Yu, S., Li, X., Zhao, X., Zhang, Z., Wu, F.: Tracking news article evolution by dense subgraph learning. Neurocomputing 168(C), 1076–1084 (2015)
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664
Zhu, C., Byrd, R.H., Lu, P., Nocedal, J.: Algorithm 778: L-BFGS-B: fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 23(4), 550–560 (1997)
Zhu, X., Oates, T.: Finding story chains in newswire articles using random walks. Inf. Syst. Front. 16(5), 753–769 (2014)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest Statement
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
This material is based upon work supported by the U.S. Army Engineering Research and Development Center under Contract No. W9132V-15-C-0006.
Rights and permissions
About this article
Cite this article
Camacho Barranco, R., Boedihardjo, A.P. & Hossain, M.S. Analyzing evolving stories in news articles. Int J Data Sci Anal 8, 241–256 (2019). https://doi.org/10.1007/s41060-017-0091-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-017-0091-9