Toward comprehensive event collections

  • Federico NanniEmail author
  • Simone Paolo Ponzetto
  • Laura Dietz


Web archives, such as the Internet Archive, preserve an unprecedented abundance of materials regarding major events and transformations in our society. In this paper, we present an approach for building event-centric sub-collections from such large archives, which includes not only the core documents related to the event itself but, even more importantly, documents describing related aspects (e.g., premises and consequences). This is achieved by identifying relevant concepts and entities from a knowledge base, and then detecting their mentions in documents, which are interpreted as indicators for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record; additionally, we test its performance on the TREC KBA Stream Corpus and on the TREC-CAR dataset, two publicly available large-scale web collections.


Event collections Named events Collection building Entity query expansion Web archives 



This work was funded by a scholarship of the Eliteprogramm for Postdocs of the Baden-Württemberg Stiftung (project “Knowledge Consolidation and Organization for Query-specific Wikipedia Construction”) and by an AWS Research Award (Promotional credits name: EDU_R_FY2015_Q3_MannheimUniversity_Dietz). Furthermore, this work was also supported by the Junior-professor funding program of the Ministry of Science, Research and the Arts of the state of Baden-Württemberg (project “Deep semantic models for high-end NLP application”).


  1. 1.
    Abujabal, A., Berberich, K.: Important events in the past, present, and future. In: WWW (2015)Google Scholar
  2. 2.
    Allan, J.: Introduction to topic detection and tracking. In: Allan, J. (ed.) Topic Detection and Tracking. The Information Retrieval Series, vol. 12. Springer, Boston, MA (2002)Google Scholar
  3. 3.
    Allan, J., Lavrenko, V., Jin, H: First story detection in TDT is hard. In: CIKM (2000)Google Scholar
  4. 4.
    Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: SIGIR (1998)Google Scholar
  5. 5.
    Aslam, J.A., Ekstrand-Abueg, M., Pavlu, V., Diaz, F., Sakai, T.: TREC 2013 temporal summarization. In: TREC (2013)Google Scholar
  6. 6.
    Au Yeung, C.-M., Jatowt, A.: Studying how the past is remembered: towards computational history through large scale text mining. In: CIKM (2011)Google Scholar
  7. 7.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A Nucleus for a Web of Open Data. Springer, Berlin (2007)Google Scholar
  8. 8.
    Bailey, S., Thompson, D.: Building the uk’s first public web archive. D-Lib 12, 1 (2006)CrossRefGoogle Scholar
  9. 9.
    Bethard, S.: Cleartk-timeml: a minimalist approach to tempeval 2013. In: SEM (2013)Google Scholar
  10. 10.
    Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: WSDM (2015)Google Scholar
  11. 11.
    Cano, I., Singh, S., Guestrin, C.: Distributed non-parametric representations for vital filtering: UW at TREC KBA. In: TREC (2014)Google Scholar
  12. 12.
    Ceroni, A., Gadiraju, U., Matschke, J., Wingert, S., Fisichella, M.: Where the event lies: predicting event occurrence in textual documents. In: SIGIR (2016)Google Scholar
  13. 13.
    Dalton, J., Dietz, L., Allan, J.: Entity query feature expansion using knowledge base links. In: SIGIR (2014)Google Scholar
  14. 14.
    Dietz, L., Gamari, B.: TREC CAR: A Data Set for Complex Answer Retrieval. Version 1.5 (2017)Google Scholar
  15. 15.
    Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: EMNLP (2010)Google Scholar
  16. 16.
    Ferragina, P., Scaiella, U.: Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In: CIKM (2010)Google Scholar
  17. 17.
    Glavaš, G., Šnajder, J.: Construction and evaluation of event graphs. Nat. Lang. Eng. 21, 04 (2015)Google Scholar
  18. 18.
    Gomes, D., Miranda, J., Costa, M.: A survey on web archiving initiatives. In: TPDL (2011)Google Scholar
  19. 19.
    Gossen, G., Demidova, E., Risse, T.: Extracting event-centric document collections from large-scale web archives. In: TPDL (2017)Google Scholar
  20. 20.
    Graus, D., Peetz, M.-H., Odijk, D., de Rooij, O., de Rijke, M.: yourhistory-semantic linking for a personalized timeline of historic events. In: Workshop: LinkedUp Challenge at OKCon (2013)Google Scholar
  21. 21.
    Gupta, D.: Event search and analytics: detecting events in semantically annotated corpora for search and analytics. In: WSDM (2016)Google Scholar
  22. 22.
    Hasibi, F., Balog, K., Bratsberg, S.E.: Exploiting entity linking in queries for entity retrieval. In: ICTIR (2016)Google Scholar
  23. 23.
    Hockx-Yu, H.: Access and scholarly use of web archives. Alex. J. Nat. Int. Libr. Inf. 25, 1–2 (2014)Google Scholar
  24. 24.
    Hyde, S.D., Marinov, N.: Which elections can be lost? Polit. Anal. 20, 191–210 (2012)CrossRefGoogle Scholar
  25. 25.
    Jatowt, A. Au Yeung, C.-M.: Extracting collective expectations about the future from large text collections. In: CIKM (2011)Google Scholar
  26. 26.
    Kedzie, C., McKeown, K., Diaz, F.: Summarizing disasters over time. In: Workshop on Social Good at SIGKDD (2014)Google Scholar
  27. 27.
    Kotov, A., Zhai, C.: Tapping into knowledge base for concept feedback: leveraging conceptnet to improve search results for difficult queries. In: WSDM (2012)Google Scholar
  28. 28.
    Kuzey, E., Vreeken, J., Weikum, G.: A fresh look on knowledge bases: distilling named events from news. In: CIKM (2014)Google Scholar
  29. 29.
    Lepore, J.: The cobweb: can the internet be archived? The New Yorker (2015)Google Scholar
  30. 30.
    Lewis, D.: The trec-4 filtering track. In: TREC (1995)Google Scholar
  31. 31.
    Li, H.: Learning to rank for information retrieval and natural language processing. Synth. Lect. Hum. Lang. Technol. 7, 3 (2014)Google Scholar
  32. 32.
    Liu, X., Fang, H.: Latent entity space: a novel retrieval approach for entity-bearing queries. Inf. Retr. J. 18, 6 (2015)Google Scholar
  33. 33.
    Lyman, P., Kahle, B.: Archiving digital cultural artifacts. D-Lib 4, 7 (1998)CrossRefGoogle Scholar
  34. 34.
    Menini, S., Sprugnoli, R., Moretti, G., Bignotti, E., Tonelli, S., Lepri, B.: Ramble on: tracing movements of popular historical figures. In: EACL (2017)Google Scholar
  35. 35.
    Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38, 11 (1995)CrossRefGoogle Scholar
  36. 36.
    Milligan, I., Ruest, N., Lin, J.: Content selection and curation for web archiving: the gatekeepers vs. the masses. In: JCDL (2016)Google Scholar
  37. 37.
    Mishra, A., Berberich, K.: Expose: exploring past news for seminal events. In: WWW (2015)Google Scholar
  38. 38.
    Mishra, A., Berberich, K.: Event digest: a holistic view on past events. In: SIGIR (2016)Google Scholar
  39. 39.
    Nanni, F., Mitra, B., Magnusson, M., Dietz, L.: Benchmark for complex answer retrieval. In: ICTIR (2017a)Google Scholar
  40. 40.
    Nanni, F., Ponzetto, S.P., Dietz, L.: Entity relatedness for retrospective analyses of global events. In: NLP+CSS at WebSci (2016)Google Scholar
  41. 41.
    Nanni, F., Ponzetto, S.P., Dietz, L.: Building entity-centric event collections. In: JCDL (2017b)Google Scholar
  42. 42.
    Nanni, F., Ponzetto, S.P., Dietz, L.: Entity-aspect linking: providing fine-grained semantics of entities in context. In: JCDL (2018)Google Scholar
  43. 43.
    Nanni, F., Zhao, Y., Ponzetto, S.P., Dietz, L.: Enhancing domain-specific entity linking in dh. In: DH (2017c)Google Scholar
  44. 44.
    Ntoulas, A., Cho, J., Olston, C.: What’s new on the web? In: WWW (2004)Google Scholar
  45. 45.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)Google Scholar
  46. 46.
    Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: WWW (2010)Google Scholar
  47. 47.
    Raviv, H., Kurland, O., Carmel, D.: Document retrieval using entity-based language models. In: SIGIR (2016)Google Scholar
  48. 48.
    Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data mining. In: ISWC (2016)Google Scholar
  49. 49.
    Rollason-Cass, S., Reed, S.: Living movements, living archives: selecting and archiving web content during times of social unrest. N. Rev. Inf. Netw 20, 1–2 (2015)CrossRefGoogle Scholar
  50. 50.
    Rovera, M., Nanni, F., Ponzetto, S.P., Goy, A.: Domain-specific named entity disambiguation in historical memoirs. In: CLiC-it (2017)Google Scholar
  51. 51.
    Schich, M., Song, C., Ahn, Y.-Y., Mirsky, A., Martino, M., Barabási, A.-L., Helbing, D.: A network framework of cultural history. Science 345, 6196 (2014)CrossRefGoogle Scholar
  52. 52.
    Schuhmacher, M., Dietz, L., Paolo Ponzetto, S.: Ranking entities for web queries through text and knowledge. In: CIKM (2015)Google Scholar
  53. 53.
    Singh, J., Nejdl, W., Anand, A.: Expedition: a time-aware exploratory search system designed for scholars. In: SIGIR (2016)Google Scholar
  54. 54.
    Sprugnoli, R., Tonelli, S.: One, no one and one hundred thousand events: defining and processing events in an inter-disciplinary perspective. Nat. Lang. Eng. 23, 485 (2016)CrossRefGoogle Scholar
  55. 55.
    Tuck, J.: Web archiving in the UK: cooperation, legislation and regulation. Liber Q. 18, 3–4 (2008)CrossRefGoogle Scholar
  56. 56.
    Witten, I., Milne, D.: An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Workshop on Wikipedia and Artificial Intelligence at AAAI (2008)Google Scholar
  57. 57.
    Wolfreys, J.: Readings: Acts of Close Reading in Literary Theory. Edinburgh University Press, Edinburgh (2000)Google Scholar
  58. 58.
    Xiong, C., Callan, J.: Esdrank: connecting query and documents through external semi-structured data. In: CIKM (2015)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Federico Nanni
    • 1
    Email author
  • Simone Paolo Ponzetto
    • 1
  • Laura Dietz
    • 2
  1. 1.Data and Web Science GroupUniversity of MannheimMannheimGermany
  2. 2.Department of Computer ScienceUniversity of New HampshireDurhamUSA

Personalised recommendations