Skip to main content

Revealing Historical Events Out of Web Archives

  • Conference paper
  • First Online:
Digital Libraries for Open Knowledge (TPDL 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11057))

Included in the following conference series:

Abstract

As the living Web expands, worldwide volumes of Web archives constantly increase, making difficult to identify relevant archived contents. Here we propose an application for detecting historical events out of a corpus of Web archives and based on an entity called Web Fragment: a semantic and syntactic subset of a given Web page. The Web fragment has the particularity to be indexed by its edition date instead of its archiving date. We apply our framework on an archived Moroccan forum and witness how it reacted to the Arab Spring at the end of 2010.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://archive.org/Web/.

  2. 2.

    Publicly available at http://maps.e-diasporas.fr/index.php?focus=map&map=5&section=5.

  3. 3.

    Open source and available at https://github.com/lobbeque/archive-miner and https://github.com/lobbeque/peastee.

  4. 4.

    See http://hadoop.apache.org/, http://spark.apache.org/ and http://lucene.apache.org/solr/.

  5. 5.

    See the accompanying video https://youtu.be/snW4O-usyTM for a peek at the GUI.

References

  1. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a vision-based page segmentation algorithm (2003)

    Google Scholar 

  2. CERN: The document that officially put the world wide web into the public domain (1993). http://cds.cern.ch/record/1164399

  3. Diminescu, D.: e-Diasporas Atlas. Explorations and Cartography of Diasporas on Digital Networks. Ed. de la Maison des Sciences de l’Homme, Paris (2012)

    Google Scholar 

  4. Fung, G.P.C., Yu, J.X., Yu, P.S., Lu, H.: Parameter free bursty events detection in text streams. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 181–192. VLDB Endowment (2005)

    Google Scholar 

  5. Jatowt, A., Kawai, Y., Tanaka, K.: Detecting age of page content. In: Proceedings of the 9th annual ACM International Workshop on Web Information and Data Management, pp. 137–144. ACM (2007)

    Google Scholar 

  6. Kahle, B.: Preserving the internet. Sci. Am. 276(276), 82–83 (1997)

    Article  Google Scholar 

  7. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450. ACM, New York (2010)

    Google Scholar 

  8. Masanès, J.: Web Archiving. Springer, New York (2006). https://doi.org/10.1007/978-3-540-46332-0

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quentin Lobbé .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lobbé, Q. (2018). Revealing Historical Events Out of Web Archives. In: Méndez, E., Crestani, F., Ribeiro, C., David, G., Lopes, J. (eds) Digital Libraries for Open Knowledge. TPDL 2018. Lecture Notes in Computer Science(), vol 11057. Springer, Cham. https://doi.org/10.1007/978-3-030-00066-0_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00066-0_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00065-3

  • Online ISBN: 978-3-030-00066-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics