Skip to main content
Log in

Abstract

Web archives, such as the Internet Archive, preserve an unprecedented abundance of materials regarding major events and transformations in our society. In this paper, we present an approach for building event-centric sub-collections from such large archives, which includes not only the core documents related to the event itself but, even more importantly, documents describing related aspects (e.g., premises and consequences). This is achieved by identifying relevant concepts and entities from a knowledge base, and then detecting their mentions in documents, which are interpreted as indicators for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record; additionally, we test its performance on the TREC KBA Stream Corpus and on the TREC-CAR dataset, two publicly available large-scale web collections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. This article builds upon and expands our previous work [41].

  2. All gold standards are available at: http://federiconanni.com/event-collections/.

  3. http://trec-kba.org/kba-stream-corpus-2014.shtml.

  4. http://trec-car.cs.unh.edu/.

  5. See, for example, Nick Ruest collection of the Bataclan Attack: http://ruebot.net/post/look-14939154-paris-bataclan-parisattacks-porteouverte-tweets.

  6. More info here: https://archive-it.org/organizations/89.

  7. For example, the UK Web Archive: https://www.webarchive.org.uk/ukwa/collection.

  8. Size of the collections varies, spanning from 18 documents to more than 6000, depending on the topic.

  9. https://www.ldc.upenn.edu/collaborations/past-projects/ace; http://www.timeml.org/tempeval/.

  10. For example: https://sites.google.com/site/cfpwsevents/.

  11. https://archive-it.org/collections/5541.

  12. https://sourceforge.net/p/lemur/wiki/RankLib/.

  13. https://catalog.ldc.upenn.edu/ldc2008t19.

  14. THOMAS has been a digital collection directed by the Library of Congress. It offered, among other materials, the official record of proceedings and debate since the 101th Congress (1989–1990). In 2016, THOMAS has been completely substituted with Congress.gov, which provides full-text access to daily congressional record issues dating from 1995 (beginning with the 104th Congress).

  15. http://trec-car.cs.unh.edu/.

  16. As also remarked in [19].

  17. Cf. e.g., the first multi-party election in Algeria, 1991.

  18. See for example the Italian general election in 1996.

  19. http://www.nelda.co/.

  20. A list of all events examined in our work is available here: https://federiconanni.com/event-collections/.

  21. https://en.wikipedia.org/wiki/Category:Protests; https://en.wikipedia.org/wiki/Category:Economic_crises; https://en.wikipedia.org/wiki/Category:Government_crises.

  22. https://en.wikipedia.org/wiki/Category:20th-century_conflicts_by_year; https://en.wikipedia.org/wiki/Category:Civil_wars.

  23. We also tested TF-IDF weighted frequency, but we did not obtain any significant improvement over raw frequency.

  24. For example, the youth organization PORA is related to the aspects Protests and Internet usage of the event Orange Revolution and less to its Causes.

  25. Which corresponds to EvAsp-GloVe.

  26. http://trec.nist.gov/trec_eval/.

  27. Method marked with * is significantly better than all others on its left.

  28. We detected and removed news duplicates from the initial pool of potentially relevant documents, before conducting the final evaluation.

References

  1. Abujabal, A., Berberich, K.: Important events in the past, present, and future. In: WWW (2015)

  2. Allan, J.: Introduction to topic detection and tracking. In: Allan, J. (ed.) Topic Detection and Tracking. The Information Retrieval Series, vol. 12. Springer, Boston, MA (2002)

    Chapter  Google Scholar 

  3. Allan, J., Lavrenko, V., Jin, H: First story detection in TDT is hard. In: CIKM (2000)

  4. Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: SIGIR (1998)

  5. Aslam, J.A., Ekstrand-Abueg, M., Pavlu, V., Diaz, F., Sakai, T.: TREC 2013 temporal summarization. In: TREC (2013)

  6. Au Yeung, C.-M., Jatowt, A.: Studying how the past is remembered: towards computational history through large scale text mining. In: CIKM (2011)

  7. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A Nucleus for a Web of Open Data. Springer, Berlin (2007)

    Google Scholar 

  8. Bailey, S., Thompson, D.: Building the uk’s first public web archive. D-Lib 12, 1 (2006)

    Article  Google Scholar 

  9. Bethard, S.: Cleartk-timeml: a minimalist approach to tempeval 2013. In: SEM (2013)

  10. Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: WSDM (2015)

  11. Cano, I., Singh, S., Guestrin, C.: Distributed non-parametric representations for vital filtering: UW at TREC KBA. In: TREC (2014)

  12. Ceroni, A., Gadiraju, U., Matschke, J., Wingert, S., Fisichella, M.: Where the event lies: predicting event occurrence in textual documents. In: SIGIR (2016)

  13. Dalton, J., Dietz, L., Allan, J.: Entity query feature expansion using knowledge base links. In: SIGIR (2014)

  14. Dietz, L., Gamari, B.: TREC CAR: A Data Set for Complex Answer Retrieval. Version 1.5 (2017)

  15. Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: EMNLP (2010)

  16. Ferragina, P., Scaiella, U.: Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In: CIKM (2010)

  17. Glavaš, G., Šnajder, J.: Construction and evaluation of event graphs. Nat. Lang. Eng. 21, 04 (2015)

    Article  Google Scholar 

  18. Gomes, D., Miranda, J., Costa, M.: A survey on web archiving initiatives. In: TPDL (2011)

    Chapter  Google Scholar 

  19. Gossen, G., Demidova, E., Risse, T.: Extracting event-centric document collections from large-scale web archives. In: TPDL (2017)

  20. Graus, D., Peetz, M.-H., Odijk, D., de Rooij, O., de Rijke, M.: yourhistory-semantic linking for a personalized timeline of historic events. In: Workshop: LinkedUp Challenge at OKCon (2013)

  21. Gupta, D.: Event search and analytics: detecting events in semantically annotated corpora for search and analytics. In: WSDM (2016)

  22. Hasibi, F., Balog, K., Bratsberg, S.E.: Exploiting entity linking in queries for entity retrieval. In: ICTIR (2016)

  23. Hockx-Yu, H.: Access and scholarly use of web archives. Alex. J. Nat. Int. Libr. Inf. 25, 1–2 (2014)

    Google Scholar 

  24. Hyde, S.D., Marinov, N.: Which elections can be lost? Polit. Anal. 20, 191–210 (2012)

    Article  Google Scholar 

  25. Jatowt, A. Au Yeung, C.-M.: Extracting collective expectations about the future from large text collections. In: CIKM (2011)

  26. Kedzie, C., McKeown, K., Diaz, F.: Summarizing disasters over time. In: Workshop on Social Good at SIGKDD (2014)

  27. Kotov, A., Zhai, C.: Tapping into knowledge base for concept feedback: leveraging conceptnet to improve search results for difficult queries. In: WSDM (2012)

  28. Kuzey, E., Vreeken, J., Weikum, G.: A fresh look on knowledge bases: distilling named events from news. In: CIKM (2014)

  29. Lepore, J.: The cobweb: can the internet be archived? The New Yorker (2015)

  30. Lewis, D.: The trec-4 filtering track. In: TREC (1995)

  31. Li, H.: Learning to rank for information retrieval and natural language processing. Synth. Lect. Hum. Lang. Technol. 7, 3 (2014)

    Google Scholar 

  32. Liu, X., Fang, H.: Latent entity space: a novel retrieval approach for entity-bearing queries. Inf. Retr. J. 18, 6 (2015)

    Google Scholar 

  33. Lyman, P., Kahle, B.: Archiving digital cultural artifacts. D-Lib 4, 7 (1998)

    Article  Google Scholar 

  34. Menini, S., Sprugnoli, R., Moretti, G., Bignotti, E., Tonelli, S., Lepri, B.: Ramble on: tracing movements of popular historical figures. In: EACL (2017)

  35. Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38, 11 (1995)

    Article  Google Scholar 

  36. Milligan, I., Ruest, N., Lin, J.: Content selection and curation for web archiving: the gatekeepers vs. the masses. In: JCDL (2016)

  37. Mishra, A., Berberich, K.: Expose: exploring past news for seminal events. In: WWW (2015)

  38. Mishra, A., Berberich, K.: Event digest: a holistic view on past events. In: SIGIR (2016)

  39. Nanni, F., Mitra, B., Magnusson, M., Dietz, L.: Benchmark for complex answer retrieval. In: ICTIR (2017a)

  40. Nanni, F., Ponzetto, S.P., Dietz, L.: Entity relatedness for retrospective analyses of global events. In: NLP+CSS at WebSci (2016)

  41. Nanni, F., Ponzetto, S.P., Dietz, L.: Building entity-centric event collections. In: JCDL (2017b)

  42. Nanni, F., Ponzetto, S.P., Dietz, L.: Entity-aspect linking: providing fine-grained semantics of entities in context. In: JCDL (2018)

  43. Nanni, F., Zhao, Y., Ponzetto, S.P., Dietz, L.: Enhancing domain-specific entity linking in dh. In: DH (2017c)

  44. Ntoulas, A., Cho, J., Olston, C.: What’s new on the web? In: WWW (2004)

  45. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)

  46. Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: WWW (2010)

  47. Raviv, H., Kurland, O., Carmel, D.: Document retrieval using entity-based language models. In: SIGIR (2016)

  48. Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data mining. In: ISWC (2016)

    Google Scholar 

  49. Rollason-Cass, S., Reed, S.: Living movements, living archives: selecting and archiving web content during times of social unrest. N. Rev. Inf. Netw 20, 1–2 (2015)

    Article  Google Scholar 

  50. Rovera, M., Nanni, F., Ponzetto, S.P., Goy, A.: Domain-specific named entity disambiguation in historical memoirs. In: CLiC-it (2017)

    Chapter  Google Scholar 

  51. Schich, M., Song, C., Ahn, Y.-Y., Mirsky, A., Martino, M., Barabási, A.-L., Helbing, D.: A network framework of cultural history. Science 345, 6196 (2014)

    Article  Google Scholar 

  52. Schuhmacher, M., Dietz, L., Paolo Ponzetto, S.: Ranking entities for web queries through text and knowledge. In: CIKM (2015)

  53. Singh, J., Nejdl, W., Anand, A.: Expedition: a time-aware exploratory search system designed for scholars. In: SIGIR (2016)

  54. Sprugnoli, R., Tonelli, S.: One, no one and one hundred thousand events: defining and processing events in an inter-disciplinary perspective. Nat. Lang. Eng. 23, 485 (2016)

    Article  Google Scholar 

  55. Tuck, J.: Web archiving in the UK: cooperation, legislation and regulation. Liber Q. 18, 3–4 (2008)

    Article  Google Scholar 

  56. Witten, I., Milne, D.: An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Workshop on Wikipedia and Artificial Intelligence at AAAI (2008)

  57. Wolfreys, J.: Readings: Acts of Close Reading in Literary Theory. Edinburgh University Press, Edinburgh (2000)

    Google Scholar 

  58. Xiong, C., Callan, J.: Esdrank: connecting query and documents through external semi-structured data. In: CIKM (2015)

Download references

Acknowledgements

This work was funded by a scholarship of the Eliteprogramm for Postdocs of the Baden-Württemberg Stiftung (project “Knowledge Consolidation and Organization for Query-specific Wikipedia Construction”) and by an AWS Research Award (Promotional credits name: EDU_R_FY2015_Q3_MannheimUniversity_Dietz). Furthermore, this work was also supported by the Junior-professor funding program of the Ministry of Science, Research and the Arts of the state of Baden-Württemberg (project “Deep semantic models for high-end NLP application”).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Federico Nanni.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nanni, F., Ponzetto, S.P. & Dietz, L. Toward comprehensive event collections. Int J Digit Libr 21, 215–229 (2020). https://doi.org/10.1007/s00799-018-0246-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-018-0246-x

Keywords

Navigation