Skip to main content

Creating Event-Centric Collections from Web Archives

  • Chapter
  • First Online:
The Past Web

Abstract

Web archives are an essential information source for research on historical events. However, the large scale and heterogeneity of web archives make it difficult for researchers to access relevant event-specific materials. In this chapter, we discuss methods for creating event-centric collections from large-scale web archives. These methods are manifold and may require manual curation, adopt search or deploy focused crawling. In this chapter, we focus on the crawl-based methods that identify relevant documents in and across web archives and include link networks as context in the resulting collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • AlNoamany Y, Weigle MC, Nelson ML (2017) Generating stories from archived collections. In: Proceedings of the 2017 ACM conference on web science, ACM, WebSci ’17, pp 309–318

    Google Scholar 

  • Bicho D, Gomes D (2016) Preserving websites of research & development projects. In: Proceedings of the 13th international conference on digital preservation, iPRES 2016

    Google Scholar 

  • Bornand NJ, Balakireva L, de Sompel HV (2016) Routing memento requests using binary classifiers. CoRR abs/1606.09136

    Google Scholar 

  • Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw 31(11–16):1623–1640

    Article  Google Scholar 

  • Demidova E, Barbieri N, Dietze S, Funk A, Holzmann H, Maynard D, Papailiou N, Peters W, Risse T, Spiliotopoulos D (2014) Analysing and enriching focused semantic web archives for parliament applications. Fut Intern 6(3):433–456

    Article  Google Scholar 

  • Farag MMG, Lee S, Fox EA (2018) Focused crawler for events. Int J Digit Lib 19(1):3–19

    Article  Google Scholar 

  • Gossen G, Demidova E, Risse T (2015a) iCrawl: improving the freshness of web collections by integrating social web and focused web crawling. In: Proceedings of the 15th ACM/IEEE-CE joint conference on digital libraries. ACM, New York, pp 75–84

    Chapter  Google Scholar 

  • Gossen G, Demidova E, Risse T (2015b) The iCrawl wizard - supporting interactive focused crawl specification. In: Proceedings of the 37th European conference on IR research, ECIR 2015. Lecture Notes in Computer Science, vol 9022, pp 797–800

    Google Scholar 

  • Gossen G, Demidova E, Risse T (2017) Extracting event-centric document collections from large-scale web archives. In: Proceedings of the 21st international conference on Theory and Practice of Digital Libraries, TPDL 2017, pp 116–127

    Google Scholar 

  • Gossen G, Demidova E, Risse T (2020) Towards extracting event-centric collections from Web archives. Int J Digit Lib 21(1):31–45

    Article  Google Scholar 

  • Gottschalk S, Demidova E (2018) EventKG: a multilingual event-centric temporal knowledge graph. In: Proceedings of the ESWC 2018, pp 272–287

    Google Scholar 

  • Gottschalk S, Demidova E (2019) EventKG - the hub of event knowledge on the web - and biographical timeline generation. Seman Web 10(6):1039–1070

    Article  Google Scholar 

  • Holzmann H, Risse T (2017) Accessing web archives from different perspectives with potential synergies. In: Researchers, practitioners and their use of the archived web (London, 2017)

    Google Scholar 

  • Jatowt A, Yeung CA, Tanaka K (2013) Estimating document focus time. In: Proceedings of the 22nd ACM international conference on information and knowledge management, CIKM’13. ACM, New York, pp 2273–2278

    Google Scholar 

  • Kanhabua N, Nørvåg K (2011) A comparison of time-aware ranking methods. In: Proceeding of the 34th international ACM SIGIR conference on research and development in information retrieval, SIGIR 2011. ACM, New York, pp 1257–1258

    Google Scholar 

  • Kim J (1976) Events as property exemplifications. Springer Netherlands, Dordrecht, pp 159–177

    Google Scholar 

  • Klein M, Balakireva L, de Sompel HV (2018) Focused crawl of web archives to build event collections. In: Proceedings of the 10th ACM conference on web science, WebSci 2018. ACM, New York, pp 333–342

    Google Scholar 

  • Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C (2015) DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Seman. Web 6(2):167–195

    Article  Google Scholar 

  • Menczer F, Monge AE (1999) Scalable web search by adaptive online agents: an InfoSpiders case study. Springer, Berlin, Heidelberg, pp 323–347

    Google Scholar 

  • Nanni F, Ponzetto SP, Dietz L (2017) Building entity-centric event collections. In: Proceedings of the 2017 ACM/IEEE joint conference on digital libraries, JCDL 2017. IEEE Computer Society, Washington, pp 199–208

    Google Scholar 

  • Pant G, Srinivasan P (2005) Learning to crawl: comparing classification schemes. ACM Trans Inf Syst 23(4):430–462

    Article  Google Scholar 

  • Rennie J, McCallum A (1999) Using reinforcement learning to spider the web efficiently. In: Proceedings of the sixteenth international conference on machine learning (ICML 1999). Morgan Kaufmann, Burlington, pp 335–343

    Google Scholar 

  • Risse T, Demidova E, Dietze S, Peters W, Papailiou N, Doka K, Stavrakas Y, Plachouras V, Senellart P, Carpentier F, Mantrach A, Cautis B, Siehndel P, Spiliotopoulos D (2014a) The ARCOMEM architecture for social- and semantic-driven web archiving. Fut Intern 6(4):688–716

    Article  Google Scholar 

  • Risse T, Demidova E, Gossen G (2014b) What do you want to collect from the web? In: Proceedings of the building web observatories workshop (BWOW) 2014

    Google Scholar 

  • Singh N, Sandhawalia H, Monet N, Poirier H, Coursimault JM (2012) Large scale URL-based classification using online incremental learning. In: Proceedings of the 2012 11th international conference on machine learning and applications, ICMLA ’12, vol 02. IEEE Computer Society, Washington, pp 402–409

    Google Scholar 

  • Souza T, Demidova E, Risse T, Holzmann H, Gossen G, Szymanski J (2015) Semantic URL analytics to support efficient annotation of large scale web archives. In: First COST Action IC1302 International KEYSTONE conference IKC, 2015. Lecture notes in computer science, vol 9398. Springer, New York, pp 153–166

    Google Scholar 

  • Vrandecic D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85

    Article  Google Scholar 

  • Westermann U, Jain R (2007) Toward a common event model for multimedia applications. IEEE MultiMedia 14(1):19–29

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elena Demidova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Demidova, E., Risse, T. (2021). Creating Event-Centric Collections from Web Archives. In: Gomes, D., Demidova, E., Winters, J., Risse, T. (eds) The Past Web. Springer, Cham. https://doi.org/10.1007/978-3-030-63291-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63291-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63290-8

  • Online ISBN: 978-3-030-63291-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics