Skip to main content
Log in

Analyzing evolving stories in news articles

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

There is an overwhelming number of news articles published every day around the globe. Following the evolution of a news story is a difficult task given that there is no such mechanism available to track back in time to discover and study the hidden relationships between relevant events in digital news feeds. The techniques developed so far to extract meaningful information from a massive corpus rely on similarity search, which results in a myopic loopback to the same topic without providing the needed insights to hypothesize the origin of a story that may be completely different than the news today. In this paper, we present an algorithm that mines historical news data to detect the origin of an event, segments the timeline into disjoint groups of coherent news articles, and outlines the most important documents in a timeline with a soft probability to provide a better understanding of the evolution of a story. Qualitative and quantitative evaluations of our framework demonstrate that our algorithm discovers statistically significant and meaningful stories in reasonable time. Additionally, a relevant case study on a set of news articles demonstrates that the generated output of the algorithm holds the promise to aid prediction of future entities (e.g., actors) in a story.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Evaluation questions available at: https://storyeval.herokuapp.com/.

References

  1. Ahmed, A., Ho, Q., Eisenstein, J., Xing, E., Smola, A.J., Teo, C.H.: Unified analysis of streaming news. In: WWW ’11, pp. 267–276. ACM, New York (2011)

  2. Alias-i: LingPipe 4.1.0 (2008). http://alias-i.com/lingpipe/. Accessed 20 Sept 2016

  3. Allan, J., Gupta, R., Khandelwal, V.: Temporal summaries of new topics. In: SIGIR ’01, pp. 10–18. ACM, New York (2001)

  4. Angulo, J.J., Pederneiras, C.A., Ebner, W., Kimura, E.M., Megale, P.: Concepts of diffusion theory and a graphic approach to the description of the epidemic flow of contagious disease. Public Health Rep. 95(5), 478–485 (1980)

    Google Scholar 

  5. Apache Software Foundation: Hadoop. https://hadoop.apache.org. Accessed 28 July 2017

  6. Binh Tran, G.: Structured summarization for news events. In: WWW ’13 Companion, pp. 343–348. ACM, New York (2013)

  7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  8. Faloutsos, C., McCurley, K.S., Tomkins, A.: Fast discovery of connection subgraphs. In: KDD ’04, pp. 118–127. ACM, New York (2004)

  9. Fang, L., Sarma, A.D., Yu, C., Bohannon, P.: REX: explaining relationships between entity pairs. Proc. VLDB Endow. 5(3), 241–252 (2011)

    Article  Google Scholar 

  10. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: ACL ’05, pp. 363–370. ACL, Stroudsburg (2005)

  11. Gillenwater, J., Kulesza, A., Taskar, B.: Discovering diverse and salient threads in document collections. In: EMNLP-CoNLL ’12, pp. 710–720. ACL, Stroudsburg (2012)

  12. Gu, W., Dong, S., Chen, M.: Personalized news recommendation based on articles chain building. Neural Comput Appl 27(5), 1263–1272 (2016)

    Article  Google Scholar 

  13. Heath, K., Gelfand, N., Ovsjanikov, M., Aanjaneya, M., Guibas, L.J.: Image webs: computing and exploiting connectivity in image collections. In: CVPR ’10, pp. 3432–3439 (2010)

  14. Hossain, M.S., Andrews, C., Ramakrishnan, N., North, C.: Helping intelligence analysts make connections. In: AAAIWS’11, pp. 22–31. AAAI Press, Menlo Park (2011)

  15. Hossain, M.S., Butler, P., Boedihardjo, A.P., Ramakrishnan, N.: Storytelling in entity networks to support intelligence analysts. In: KDD ’12, pp. 1375–1383. ACM, New York (2012)

  16. Hossain, M.S., Gresock, J., Edmonds, Y., Helm, R., Potts, M., Ramakrishnan, N.: Connecting the dots between PubMed abstracts. PloS ONE 7(1), e29509 (2012)

    Article  Google Scholar 

  17. Jo, Y., Hopcroft, J.E., Lagoze, C.: The web of topics: discovering the topology of topic evolution in a corpus. In: WWW ’11, pp. 257–266. ACM, New York (2011)

  18. Kim, D., Oh, A.: Topic chains for understanding a news corpus. In: CICLing’11, pp. 163–176. Springer, Berlin (2011)

    Chapter  Google Scholar 

  19. Kleinberg, J.: Bursty and hierarchical structure in streams. In: KDD ’02, pp. 91–101. ACM, New York (2002)

  20. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  Google Scholar 

  21. Kumar, D., Ramakrishnan, N., Helm, R.F., Potts, M.: Algorithms for storytelling. IEEE Trans. Knowl. Data Eng. 20(6), 736–751 (2008)

    Article  Google Scholar 

  22. Kuzey, E., Vreeken, J., Weikum, G.: A fresh look on knowledge bases: distilling named events from news. In: CIKM ’14, pp. 1689–1698. ACM, New York (2014)

  23. Leskovec, J., Sosič, R.: Snap: a general-purpose network analysis and graph-mining library. ACM Trans. Intell. Syst. Technol. (TIST) 8(1), 1 (2016)

    Article  Google Scholar 

  24. Luo, X., Xuan, J., Lu, J., Zhang, G.: Measuring the semantic uncertainty of news events for evolution potential estimation. ACM Trans. Inf. Syst. 34(4), 24:1–24:25 (2016)

    Article  Google Scholar 

  25. Kader, M. A., Naim, S. M., Boedihardjo, A. P., Hossain, M. S.: Connecting the dots using contextual information hidden in text and images. In: AAAI Conference on Artificial Intelligence (2016)

  26. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)

    MathSciNet  MATH  Google Scholar 

  27. Nallapati, R., Feng, A., Peng, F., Allan, J.: Event threading within news topics. In: CIKM ’04, pp. 446–453. ACM, New York (2004)

  28. Nenkova, A., McKeown, K.: A survey of text summarization techniques. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 43–76. Springer, New York (2012)

    Chapter  Google Scholar 

  29. Ning, Y., Muthiah, S., Tandon, R., Ramakrishnan, N.: Uncovering news-twitter reciprocity via interaction patterns. In: ASONAM ’15, pp. 1–8. ACM, New York (2015)

  30. Radev, D., Otterbacher, J., Winkel, A., Blair-Goldensohn, S.: NewsInEssence: summarizing online news topics. Commun. ACM 48(10), 95–98 (2005)

    Article  Google Scholar 

  31. Radinsky, K., Davidovich, S., Markovitch, S.: Learning causality for news events prediction. In: WWW ’12, pp. 909–918. ACM, New York (2012)

  32. Rospocher, M., van Erp, M., Vossen, P., Fokkens, A., Aldabe, I., Rigau, G., Soroa, A., Ploeger, T., Bogaard, T.: Building event-centric knowledge graphs from news. Web Semant. 37(C), 132–151 (2016)

    Article  Google Scholar 

  33. Shahaf, D., Guestrin, C.: Connecting the dots between news articles. In: KDD ’10, pp. 623–632. ACM, New York (2010)

  34. Shahaf, D., Guestrin, C., Horvitz, E., Leskovec, J.: Information cartography. Commun. ACM 58(11), 62–73 (2015)

    Article  Google Scholar 

  35. Suen, C., Huang, S., Eksombatchai, C., Sosic, R., Leskovec, J.: NIFTY: a system for large scale information flow tracking and clustering. In: WWW ’13, pp. 1237–1248. ACM, New York (2013)

  36. Wang, X., Zhai, C., Roth, D.: Understanding evolution of research themes: a probabilistic generative model for citations. In: KDD ’13, pp. 1115–1123. ACM, New York (2013)

  37. Warcbase: Named Entity Recognition (2016). https://lintool.github.io/warcbase-docs/Spark-Named-Entity-Recognition/. Accessed 20 Sept 2016

  38. Wu, C., Wu, B., Wang, B.: Event evolution model based on random walk model with hot topic extraction. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q.Z. (eds.) Proceedings of advanced data mining and applications: 12th international conference, ADMA 2016, Gold Coast, QLD, Australia, 12–15 Dec 2016, pp. 591–603. Springer, Cham. https://doi.org/10.1007/978-3-319-49586-6_42 (2016)

    Chapter  Google Scholar 

  39. Yan, R., Wan, X., Otterbacher, J., Kong, L., Li, X., Zhang, Y.: Evolutionary timeline summarization: a balanced optimization framework via iterative substitution. In: SIGIR ’11, pp. 745–754. ACM, New York (2011)

  40. Yang, Y., Ault, T., Pierce, T., Lattimer, C.W.: Improving text categorization methods for event tracking. In: SIGIR ’00, pp. 65–72. ACM, New York (2000)

  41. Yang, Y., Carbonell, J.G., Brown, R.D., Pierce, T., Archibald, B.T., Liu, X.: Learning approaches for detecting and tracking news events. IEEE Intell. Syst. Appl. 14(4), 32–43 (1999)

    Article  Google Scholar 

  42. Yu, S., Li, X., Zhao, X., Zhang, Z., Wu, F.: Tracking news article evolution by dense subgraph learning. Neurocomputing 168(C), 1076–1084 (2015)

    Article  Google Scholar 

  43. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664

    Article  Google Scholar 

  44. Zhu, C., Byrd, R.H., Lu, P., Nocedal, J.: Algorithm 778: L-BFGS-B: fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 23(4), 550–560 (1997)

    Article  MathSciNet  Google Scholar 

  45. Zhu, X., Oates, T.: Finding story chains in newswire articles using random walks. Inf. Syst. Front. 16(5), 753–769 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roberto Camacho Barranco.

Ethics declarations

Conflict of Interest Statement

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

This material is based upon work supported by the U.S. Army Engineering Research and Development Center under Contract No. W9132V-15-C-0006.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Camacho Barranco, R., Boedihardjo, A.P. & Hossain, M.S. Analyzing evolving stories in news articles. Int J Data Sci Anal 8, 241–256 (2019). https://doi.org/10.1007/s41060-017-0091-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-017-0091-9

Keywords

Navigation