Towards extracting event-centric collections from Web archives

Abstract

Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    https://archive.org.

  2. 2.

    https://archive-it.org/.

  3. 3.

    https://blog.archive.org/2016/10/24/beta-wayback-machine-now-with-site-search/.

  4. 4.

    https://github.com/gerhardgossen/archive-recrawling.

  5. 5.

    http://lucene.apache.org/core/.

  6. 6.

    Code available at: https://github.com/gerhardgossen/dictionary-creator/.

  7. 7.

    https://github.com/gerhardgossen/archive-recrawling.

References

  1. 1.

    Aggarwal, C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the World Wide Web with arbitrary predicates. In: Proceedings of the 10th International World Wide Web Conference, WWW’01. pp. 96–105 (2001)

  2. 2.

    AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting off-topic pages within timemaps in web archives. Int. J. Digit. Libr. 17(3), 203–221 (2016)

    Article  Google Scholar 

  3. 3.

    AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Generating stories from archived collections. In: Proceedings of the 2017 ACM Web Science Conference, WebSci’17, ACM, New York, NY, USA, pp. 309–318 (2017)

  4. 4.

    Berberich, K., Bedathur, S.: Temporal Diversification of Search Results. In: Proceedings of the Workshop on Time-Aware Information Access (TAIA 2013) (2013)

  5. 5.

    Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Proceedings of the European Conference on Digital Libraries (ECDL’02) (2002)

    Google Scholar 

  6. 6.

    Bouzeghoub, M.: A framework for analysis of data freshness. In: Proceedings of the Workshop on Information Quality in Information Systems (2004)

  7. 7.

    Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107–117 (1998)

  8. 8.

    Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)

    Article  Google Scholar 

  9. 9.

    Costa, M., Couto, F., Silva, M.: Learning temporal-dependent ranking models. In: Proceedings of the SIGIR’14 (2014)

  10. 10.

    Costa, M., Gomes, D., Silva, M.J.: The evolution of web archiving. Int. J. Digit. Libr. 18(3), 191–205 (2017)

    Article  Google Scholar 

  11. 11.

    Demidova, E., Barbieri, N., Dietze, S., Funk, A., Holzmann, H., Maynard, D., Papailiou, N., Peters, W., Risse, T., Spiliotopoulos, D.: Analysing and enriching focused semantic web archives for parliament applications. Fut. Intern. 6(3), 433–456 (2014)

    Article  Google Scholar 

  12. 12.

    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the VLDB’00 (2000)

  13. 13.

    Dong, A., Chang, Y., Zheng, Z., Mishne, G., Bai, J., Zhang, R., Buchner, K., Liao, C., Diaz, F.: Towards recency ranking in web search. In: Proceedings of the WSDM’10 (2010)

  14. 14.

    Dong, H., Hussain, F.K.: SOF: a semi-supervised ontology-learning-based focused crawler. Concurr. Comput. Pract. Exp. 25(12), 1755–1770 (2013)

    Article  Google Scholar 

  15. 15.

    Ehrig, M., Maedche, A.: Ontology-focused crawling of web documents. In: Proceedings of the ACM SAC (2003)

  16. 16.

    Farag, M.M.G., Lee, S., Fox, E.A.: Focused crawler for events. Int. J. Digit. Libr. 19(1), 3–19 (2018)

    Article  Google Scholar 

  17. 17.

    Gossen, G., Demidova, E., Risse, T.: iCrawl: Improving the freshness of web collections by integrating social web and focused web crawling. In: Proceedings of the JCDL’15 (2015)

  18. 18.

    Gossen, G., Demidova, E., Risse, T.: The iCrawl Wizard—supporting interactive focused crawl specification. In: Proceedings of the ECIR’15 (2015)

  19. 19.

    Gossen, G., Demidova, E., Risse, T.: Analyzing Web archives through topic and event focused sub-collections. In: Proceedings of the WebSci’16, pp. 291–295 (May 2016)

  20. 20.

    Gossen, G., Demidova, E., Risse, T.: Extracting event-centric document collections from large-scale web archives. In: Proceedings of the 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, pp. 116–127 (2017)

    Google Scholar 

  21. 21.

    Gottschalk, S., Demidova, E.: EventKG: A multilingual event-centric temporal knowledge graph. In: Proceedings of the ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, pp. 272–287 (2018)

  22. 22.

    Gottschalk, S., Demidova, E., Bernacchi, V., Rogers, R., Demidova, E.: Towards better understanding researcher strategies in cross-lingual event analytics. In: Proceedings of the 22nd International Conference on Theory and Practice of Digital Libraries, TPDL 2018 (2018)

    Google Scholar 

  23. 23.

    Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)

    Article  Google Scholar 

  24. 24.

    Holzmann, H., Risse, T.: Accessing web archives from different perspectives with potential synergies. In: Researchers, Practitioners and Their Use of the Archived Web, London (2017). http://archivedweb.blogs.sas.ac.uk/files/2017/06/RESAW2017-HolzmannRisse-Accessing_web_archives_from_different_perspectives_with_potential_synergies.pdf

  25. 25.

    International Internet Presevation Consortium (IIPC): OpenWayback (2017). http://netpreserve.org/openwayback

  26. 26.

    Jiang, J., Song, X., Yu, N., Lin, C.Y.: FoCUS: learning to crawl web forums. IEEE Trans. Knowl. Data Eng. 25(6), 1293–1306 (2013)

    Article  Google Scholar 

  27. 27.

    Kanhabua, N., Nørvåg, K.: A comparison of time-aware ranking methods. In: Proceedings of the SIGIR’11 (2011)

  28. 28.

    Laranjeira, B., Moreira, V., Villavicencio, A., Ramisch, C., Finatto, M.J.: Comparing the quality of focused crawlers and of the translation resources obtained from them. In: Proceedings of the LREC’14 (2014)

  29. 29.

    Lehmann, J., Isele, R., Jakob, M., et al.: DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)

    Article  Google Scholar 

  30. 30.

    Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (2004)

  31. 31.

    Nguyen, T.N., Kanhabua, N., Niederée, C., Zhu, X.: A time-aware random walk model for finding important documents in web archives. In: Proceedings of the SIGIR’15 (2015)

  32. 32.

    Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)

    Article  Google Scholar 

  33. 33.

    Pant, G., Srinivasan, P., Menczer, F.: Crawling the web. In: Web Dynamics. Springer, New York (2004)

    Google Scholar 

  34. 34.

    Pereira, P., Macedo, J., Craveiro, O., Madeira, H.: Time-aware focused web crawling. In: Proceedings of the ECIR’14 (2014)

    Google Scholar 

  35. 35.

    Qin, J., Zhou, Y., Chau, M.: Building domain-specific Web collections for scientific digital libraries. In: Proceedings of the JCDL’04 (2004)

  36. 36.

    Risse, T., Demidova, E., Gossen, G.: What do you want to collect from the web? In: Proceedings of the Building Web Observatories Workshop (BWOW) 2014 (2014)

  37. 37.

    Rospocher, M., et al.: Building event-centric knowledge graphs from news. Web Semant. 37, 132–151 (2016)

    Article  Google Scholar 

  38. 38.

    Souza, T., Demidova, E., Risse, T., Holzmann, H., Gossen, G., Szymanski, J.: Semantic URL analytics to support efficient annotation of large scale web archives. In: Proceedings of the First International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8–9, 2015. pp. 153–166 (2015)

    Google Scholar 

  39. 39.

    Vrandečić, D.: Wikidata: A new platform for collaborative data collection. In: Proceedings of the 21st International Conference on World Wide Web. WWW’12 Companion, ACM, pp. 1063–1064 (2012)

Download references

Acknowledgements

This work was partially funded by the ERC under ALEXANDRIA (ERC 339233), H2020 under SoBigData (RIA 654024) and Cleopatra (H2020-MSCA-ITN-2018-812997), and BMBF under Data4UrbanMobility (02K15A040).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Elena Demidova.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gossen, G., Risse, T. & Demidova, E. Towards extracting event-centric collections from Web archives. Int J Digit Libr 21, 31–45 (2020). https://doi.org/10.1007/s00799-018-0258-6

Download citation

Keywords

  • Web archives
  • Event-centric document collections
  • Focused crawling