Towards extracting event-centric collections from Web archives

Gossen, Gerhard; Risse, Thomas; Demidova, Elena

doi:10.1007/s00799-018-0258-6

Towards extracting event-centric collections from Web archives

Published: 27 October 2018

Volume 21, pages 31–45, (2020)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

572 Accesses
5 Citations
Explore all metrics

Abstract

Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting Event-Centric Document Collections from Large-Scale Web Archives

Creating Event-Centric Collections from Web Archives

Revealing Historical Events Out of Web Archives

Notes

References

Aggarwal, C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the World Wide Web with arbitrary predicates. In: Proceedings of the 10th International World Wide Web Conference, WWW’01. pp. 96–105 (2001)
AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting off-topic pages within timemaps in web archives. Int. J. Digit. Libr. 17(3), 203–221 (2016)
Article Google Scholar
AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Generating stories from archived collections. In: Proceedings of the 2017 ACM Web Science Conference, WebSci’17, ACM, New York, NY, USA, pp. 309–318 (2017)
Berberich, K., Bedathur, S.: Temporal Diversification of Search Results. In: Proceedings of the Workshop on Time-Aware Information Access (TAIA 2013) (2013)
Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Proceedings of the European Conference on Digital Libraries (ECDL’02) (2002)
Chapter Google Scholar
Bouzeghoub, M.: A framework for analysis of data freshness. In: Proceedings of the Workshop on Information Quality in Information Systems (2004)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107–117 (1998)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)
Article Google Scholar
Costa, M., Couto, F., Silva, M.: Learning temporal-dependent ranking models. In: Proceedings of the SIGIR’14 (2014)
Costa, M., Gomes, D., Silva, M.J.: The evolution of web archiving. Int. J. Digit. Libr. 18(3), 191–205 (2017)
Article Google Scholar
Demidova, E., Barbieri, N., Dietze, S., Funk, A., Holzmann, H., Maynard, D., Papailiou, N., Peters, W., Risse, T., Spiliotopoulos, D.: Analysing and enriching focused semantic web archives for parliament applications. Fut. Intern. 6(3), 433–456 (2014)
Article Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the VLDB’00 (2000)
Dong, A., Chang, Y., Zheng, Z., Mishne, G., Bai, J., Zhang, R., Buchner, K., Liao, C., Diaz, F.: Towards recency ranking in web search. In: Proceedings of the WSDM’10 (2010)
Dong, H., Hussain, F.K.: SOF: a semi-supervised ontology-learning-based focused crawler. Concurr. Comput. Pract. Exp. 25(12), 1755–1770 (2013)
Article Google Scholar
Ehrig, M., Maedche, A.: Ontology-focused crawling of web documents. In: Proceedings of the ACM SAC (2003)
Farag, M.M.G., Lee, S., Fox, E.A.: Focused crawler for events. Int. J. Digit. Libr. 19(1), 3–19 (2018)
Article Google Scholar
Gossen, G., Demidova, E., Risse, T.: iCrawl: Improving the freshness of web collections by integrating social web and focused web crawling. In: Proceedings of the JCDL’15 (2015)
Gossen, G., Demidova, E., Risse, T.: The iCrawl Wizard—supporting interactive focused crawl specification. In: Proceedings of the ECIR’15 (2015)
Gossen, G., Demidova, E., Risse, T.: Analyzing Web archives through topic and event focused sub-collections. In: Proceedings of the WebSci’16, pp. 291–295 (May 2016)
Gossen, G., Demidova, E., Risse, T.: Extracting event-centric document collections from large-scale web archives. In: Proceedings of the 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, pp. 116–127 (2017)
Chapter Google Scholar
Gottschalk, S., Demidova, E.: EventKG: A multilingual event-centric temporal knowledge graph. In: Proceedings of the ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, pp. 272–287 (2018)
Gottschalk, S., Demidova, E., Bernacchi, V., Rogers, R., Demidova, E.: Towards better understanding researcher strategies in cross-lingual event analytics. In: Proceedings of the 22nd International Conference on Theory and Practice of Digital Libraries, TPDL 2018 (2018)
Chapter Google Scholar
Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)
Article Google Scholar
Holzmann, H., Risse, T.: Accessing web archives from different perspectives with potential synergies. In: Researchers, Practitioners and Their Use of the Archived Web, London (2017). http://archivedweb.blogs.sas.ac.uk/files/2017/06/RESAW2017-HolzmannRisse-Accessing_web_archives_from_different_perspectives_with_potential_synergies.pdf
International Internet Presevation Consortium (IIPC): OpenWayback (2017). http://netpreserve.org/openwayback
Jiang, J., Song, X., Yu, N., Lin, C.Y.: FoCUS: learning to crawl web forums. IEEE Trans. Knowl. Data Eng. 25(6), 1293–1306 (2013)
Article Google Scholar
Kanhabua, N., Nørvåg, K.: A comparison of time-aware ranking methods. In: Proceedings of the SIGIR’11 (2011)
Laranjeira, B., Moreira, V., Villavicencio, A., Ramisch, C., Finatto, M.J.: Comparing the quality of focused crawlers and of the translation resources obtained from them. In: Proceedings of the LREC’14 (2014)
Lehmann, J., Isele, R., Jakob, M., et al.: DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)
Article Google Scholar
Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (2004)
Nguyen, T.N., Kanhabua, N., Niederée, C., Zhu, X.: A time-aware random walk model for finding important documents in web archives. In: Proceedings of the SIGIR’15 (2015)
Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)
Article Google Scholar
Pant, G., Srinivasan, P., Menczer, F.: Crawling the web. In: Web Dynamics. Springer, New York (2004)
Chapter Google Scholar
Pereira, P., Macedo, J., Craveiro, O., Madeira, H.: Time-aware focused web crawling. In: Proceedings of the ECIR’14 (2014)
Google Scholar
Qin, J., Zhou, Y., Chau, M.: Building domain-specific Web collections for scientific digital libraries. In: Proceedings of the JCDL’04 (2004)
Risse, T., Demidova, E., Gossen, G.: What do you want to collect from the web? In: Proceedings of the Building Web Observatories Workshop (BWOW) 2014 (2014)
Rospocher, M., et al.: Building event-centric knowledge graphs from news. Web Semant. 37, 132–151 (2016)
Article Google Scholar
Souza, T., Demidova, E., Risse, T., Holzmann, H., Gossen, G., Szymanski, J.: Semantic URL analytics to support efficient annotation of large scale web archives. In: Proceedings of the First International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8–9, 2015. pp. 153–166 (2015)
Chapter Google Scholar
Vrandečić, D.: Wikidata: A new platform for collaborative data collection. In: Proceedings of the 21st International Conference on World Wide Web. WWW’12 Companion, ACM, pp. 1063–1064 (2012)

Download references

Acknowledgements

This work was partially funded by the ERC under ALEXANDRIA (ERC 339233), H2020 under SoBigData (RIA 654024) and Cleopatra (H2020-MSCA-ITN-2018-812997), and BMBF under Data4UrbanMobility (02K15A040).

Author information

Authors and Affiliations

L3S Research Center, Leibniz Universität Hannover, Hannover, Germany
Gerhard Gossen & Elena Demidova
University Library J. C. Senckenberg, Goethe-University Frankfurt am Main, Frankfurt, Germany
Thomas Risse

Authors

Gerhard Gossen
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Risse
View author publications
You can also search for this author in PubMed Google Scholar
Elena Demidova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elena Demidova.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gossen, G., Risse, T. & Demidova, E. Towards extracting event-centric collections from Web archives. Int J Digit Libr 21, 31–45 (2020). https://doi.org/10.1007/s00799-018-0258-6

Download citation

Received: 18 January 2018
Revised: 16 October 2018
Accepted: 17 October 2018
Published: 27 October 2018
Issue Date: March 2020
DOI: https://doi.org/10.1007/s00799-018-0258-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards extracting event-centric collections from Web archives

Abstract

Access this article

Similar content being viewed by others

Extracting Event-Centric Document Collections from Large-Scale Web Archives

Creating Event-Centric Collections from Web Archives

Revealing Historical Events Out of Web Archives

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards extracting event-centric collections from Web archives

Abstract

Access this article

Similar content being viewed by others

Extracting Event-Centric Document Collections from Large-Scale Web Archives

Creating Event-Centric Collections from Web Archives

Revealing Historical Events Out of Web Archives

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation