Abstract
Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Code available at: https://github.com/gerhardgossen/dictionary-creator/.
References
Aggarwal, C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the World Wide Web with arbitrary predicates. In: Proceedings of the 10th International World Wide Web Conference, WWW’01. pp. 96–105 (2001)
AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting off-topic pages within timemaps in web archives. Int. J. Digit. Libr. 17(3), 203–221 (2016)
AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Generating stories from archived collections. In: Proceedings of the 2017 ACM Web Science Conference, WebSci’17, ACM, New York, NY, USA, pp. 309–318 (2017)
Berberich, K., Bedathur, S.: Temporal Diversification of Search Results. In: Proceedings of the Workshop on Time-Aware Information Access (TAIA 2013) (2013)
Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Proceedings of the European Conference on Digital Libraries (ECDL’02) (2002)
Bouzeghoub, M.: A framework for analysis of data freshness. In: Proceedings of the Workshop on Information Quality in Information Systems (2004)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107–117 (1998)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)
Costa, M., Couto, F., Silva, M.: Learning temporal-dependent ranking models. In: Proceedings of the SIGIR’14 (2014)
Costa, M., Gomes, D., Silva, M.J.: The evolution of web archiving. Int. J. Digit. Libr. 18(3), 191–205 (2017)
Demidova, E., Barbieri, N., Dietze, S., Funk, A., Holzmann, H., Maynard, D., Papailiou, N., Peters, W., Risse, T., Spiliotopoulos, D.: Analysing and enriching focused semantic web archives for parliament applications. Fut. Intern. 6(3), 433–456 (2014)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the VLDB’00 (2000)
Dong, A., Chang, Y., Zheng, Z., Mishne, G., Bai, J., Zhang, R., Buchner, K., Liao, C., Diaz, F.: Towards recency ranking in web search. In: Proceedings of the WSDM’10 (2010)
Dong, H., Hussain, F.K.: SOF: a semi-supervised ontology-learning-based focused crawler. Concurr. Comput. Pract. Exp. 25(12), 1755–1770 (2013)
Ehrig, M., Maedche, A.: Ontology-focused crawling of web documents. In: Proceedings of the ACM SAC (2003)
Farag, M.M.G., Lee, S., Fox, E.A.: Focused crawler for events. Int. J. Digit. Libr. 19(1), 3–19 (2018)
Gossen, G., Demidova, E., Risse, T.: iCrawl: Improving the freshness of web collections by integrating social web and focused web crawling. In: Proceedings of the JCDL’15 (2015)
Gossen, G., Demidova, E., Risse, T.: The iCrawl Wizard—supporting interactive focused crawl specification. In: Proceedings of the ECIR’15 (2015)
Gossen, G., Demidova, E., Risse, T.: Analyzing Web archives through topic and event focused sub-collections. In: Proceedings of the WebSci’16, pp. 291–295 (May 2016)
Gossen, G., Demidova, E., Risse, T.: Extracting event-centric document collections from large-scale web archives. In: Proceedings of the 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, pp. 116–127 (2017)
Gottschalk, S., Demidova, E.: EventKG: A multilingual event-centric temporal knowledge graph. In: Proceedings of the ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, pp. 272–287 (2018)
Gottschalk, S., Demidova, E., Bernacchi, V., Rogers, R., Demidova, E.: Towards better understanding researcher strategies in cross-lingual event analytics. In: Proceedings of the 22nd International Conference on Theory and Practice of Digital Libraries, TPDL 2018 (2018)
Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)
Holzmann, H., Risse, T.: Accessing web archives from different perspectives with potential synergies. In: Researchers, Practitioners and Their Use of the Archived Web, London (2017). http://archivedweb.blogs.sas.ac.uk/files/2017/06/RESAW2017-HolzmannRisse-Accessing_web_archives_from_different_perspectives_with_potential_synergies.pdf
International Internet Presevation Consortium (IIPC): OpenWayback (2017). http://netpreserve.org/openwayback
Jiang, J., Song, X., Yu, N., Lin, C.Y.: FoCUS: learning to crawl web forums. IEEE Trans. Knowl. Data Eng. 25(6), 1293–1306 (2013)
Kanhabua, N., Nørvåg, K.: A comparison of time-aware ranking methods. In: Proceedings of the SIGIR’11 (2011)
Laranjeira, B., Moreira, V., Villavicencio, A., Ramisch, C., Finatto, M.J.: Comparing the quality of focused crawlers and of the translation resources obtained from them. In: Proceedings of the LREC’14 (2014)
Lehmann, J., Isele, R., Jakob, M., et al.: DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)
Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (2004)
Nguyen, T.N., Kanhabua, N., Niederée, C., Zhu, X.: A time-aware random walk model for finding important documents in web archives. In: Proceedings of the SIGIR’15 (2015)
Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)
Pant, G., Srinivasan, P., Menczer, F.: Crawling the web. In: Web Dynamics. Springer, New York (2004)
Pereira, P., Macedo, J., Craveiro, O., Madeira, H.: Time-aware focused web crawling. In: Proceedings of the ECIR’14 (2014)
Qin, J., Zhou, Y., Chau, M.: Building domain-specific Web collections for scientific digital libraries. In: Proceedings of the JCDL’04 (2004)
Risse, T., Demidova, E., Gossen, G.: What do you want to collect from the web? In: Proceedings of the Building Web Observatories Workshop (BWOW) 2014 (2014)
Rospocher, M., et al.: Building event-centric knowledge graphs from news. Web Semant. 37, 132–151 (2016)
Souza, T., Demidova, E., Risse, T., Holzmann, H., Gossen, G., Szymanski, J.: Semantic URL analytics to support efficient annotation of large scale web archives. In: Proceedings of the First International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8–9, 2015. pp. 153–166 (2015)
Vrandečić, D.: Wikidata: A new platform for collaborative data collection. In: Proceedings of the 21st International Conference on World Wide Web. WWW’12 Companion, ACM, pp. 1063–1064 (2012)
Acknowledgements
This work was partially funded by the ERC under ALEXANDRIA (ERC 339233), H2020 under SoBigData (RIA 654024) and Cleopatra (H2020-MSCA-ITN-2018-812997), and BMBF under Data4UrbanMobility (02K15A040).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gossen, G., Risse, T. & Demidova, E. Towards extracting event-centric collections from Web archives. Int J Digit Libr 21, 31–45 (2020). https://doi.org/10.1007/s00799-018-0258-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-018-0258-6