Abstract
Web archives are an essential information source for research on historical events. However, the large scale and heterogeneity of web archives make it difficult for researchers to access relevant event-specific materials. In this chapter, we discuss methods for creating event-centric collections from large-scale web archives. These methods are manifold and may require manual curation, adopt search or deploy focused crawling. In this chapter, we focus on the crawl-based methods that identify relevant documents in and across web archives and include link networks as context in the resulting collections.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
AlNoamany Y, Weigle MC, Nelson ML (2017) Generating stories from archived collections. In: Proceedings of the 2017 ACM conference on web science, ACM, WebSci ’17, pp 309–318
Bicho D, Gomes D (2016) Preserving websites of research & development projects. In: Proceedings of the 13th international conference on digital preservation, iPRES 2016
Bornand NJ, Balakireva L, de Sompel HV (2016) Routing memento requests using binary classifiers. CoRR abs/1606.09136
Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw 31(11–16):1623–1640
Demidova E, Barbieri N, Dietze S, Funk A, Holzmann H, Maynard D, Papailiou N, Peters W, Risse T, Spiliotopoulos D (2014) Analysing and enriching focused semantic web archives for parliament applications. Fut Intern 6(3):433–456
Farag MMG, Lee S, Fox EA (2018) Focused crawler for events. Int J Digit Lib 19(1):3–19
Gossen G, Demidova E, Risse T (2015a) iCrawl: improving the freshness of web collections by integrating social web and focused web crawling. In: Proceedings of the 15th ACM/IEEE-CE joint conference on digital libraries. ACM, New York, pp 75–84
Gossen G, Demidova E, Risse T (2015b) The iCrawl wizard - supporting interactive focused crawl specification. In: Proceedings of the 37th European conference on IR research, ECIR 2015. Lecture Notes in Computer Science, vol 9022, pp 797–800
Gossen G, Demidova E, Risse T (2017) Extracting event-centric document collections from large-scale web archives. In: Proceedings of the 21st international conference on Theory and Practice of Digital Libraries, TPDL 2017, pp 116–127
Gossen G, Demidova E, Risse T (2020) Towards extracting event-centric collections from Web archives. Int J Digit Lib 21(1):31–45
Gottschalk S, Demidova E (2018) EventKG: a multilingual event-centric temporal knowledge graph. In: Proceedings of the ESWC 2018, pp 272–287
Gottschalk S, Demidova E (2019) EventKG - the hub of event knowledge on the web - and biographical timeline generation. Seman Web 10(6):1039–1070
Holzmann H, Risse T (2017) Accessing web archives from different perspectives with potential synergies. In: Researchers, practitioners and their use of the archived web (London, 2017)
Jatowt A, Yeung CA, Tanaka K (2013) Estimating document focus time. In: Proceedings of the 22nd ACM international conference on information and knowledge management, CIKM’13. ACM, New York, pp 2273–2278
Kanhabua N, Nørvåg K (2011) A comparison of time-aware ranking methods. In: Proceeding of the 34th international ACM SIGIR conference on research and development in information retrieval, SIGIR 2011. ACM, New York, pp 1257–1258
Kim J (1976) Events as property exemplifications. Springer Netherlands, Dordrecht, pp 159–177
Klein M, Balakireva L, de Sompel HV (2018) Focused crawl of web archives to build event collections. In: Proceedings of the 10th ACM conference on web science, WebSci 2018. ACM, New York, pp 333–342
Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C (2015) DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Seman. Web 6(2):167–195
Menczer F, Monge AE (1999) Scalable web search by adaptive online agents: an InfoSpiders case study. Springer, Berlin, Heidelberg, pp 323–347
Nanni F, Ponzetto SP, Dietz L (2017) Building entity-centric event collections. In: Proceedings of the 2017 ACM/IEEE joint conference on digital libraries, JCDL 2017. IEEE Computer Society, Washington, pp 199–208
Pant G, Srinivasan P (2005) Learning to crawl: comparing classification schemes. ACM Trans Inf Syst 23(4):430–462
Rennie J, McCallum A (1999) Using reinforcement learning to spider the web efficiently. In: Proceedings of the sixteenth international conference on machine learning (ICML 1999). Morgan Kaufmann, Burlington, pp 335–343
Risse T, Demidova E, Dietze S, Peters W, Papailiou N, Doka K, Stavrakas Y, Plachouras V, Senellart P, Carpentier F, Mantrach A, Cautis B, Siehndel P, Spiliotopoulos D (2014a) The ARCOMEM architecture for social- and semantic-driven web archiving. Fut Intern 6(4):688–716
Risse T, Demidova E, Gossen G (2014b) What do you want to collect from the web? In: Proceedings of the building web observatories workshop (BWOW) 2014
Singh N, Sandhawalia H, Monet N, Poirier H, Coursimault JM (2012) Large scale URL-based classification using online incremental learning. In: Proceedings of the 2012 11th international conference on machine learning and applications, ICMLA ’12, vol 02. IEEE Computer Society, Washington, pp 402–409
Souza T, Demidova E, Risse T, Holzmann H, Gossen G, Szymanski J (2015) Semantic URL analytics to support efficient annotation of large scale web archives. In: First COST Action IC1302 International KEYSTONE conference IKC, 2015. Lecture notes in computer science, vol 9398. Springer, New York, pp 153–166
Vrandecic D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85
Westermann U, Jain R (2007) Toward a common event model for multimedia applications. IEEE MultiMedia 14(1):19–29
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Demidova, E., Risse, T. (2021). Creating Event-Centric Collections from Web Archives. In: Gomes, D., Demidova, E., Winters, J., Risse, T. (eds) The Past Web. Springer, Cham. https://doi.org/10.1007/978-3-030-63291-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-63291-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63290-8
Online ISBN: 978-3-030-63291-5
eBook Packages: Computer ScienceComputer Science (R0)