Skip to main content

How to Search the Internet Archive Without Indexing It

  • Conference paper
  • First Online:
Research and Advanced Technology for Digital Libraries (TPDL 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9819))

Included in the following conference series:

Abstract

Significant parts of cultural heritage are produced on the web during the last decades. While easy accessibility to the current web is a good baseline, optimal access to the past web faces several challenges. This includes dealing with large-scale web archive collections and lacking of usage logs that contain implicit human feedback most relevant for today’s web search. In this paper, we propose an entity-oriented search system to support retrieval and analytics on the Internet Archive. We use Bing to retrieve a ranked list of results from the current web. In addition, we link retrieved results to the WayBack Machine; thus allowing keyword search on the Internet Archive without processing and indexing its raw archived content. Our search system complements existing web archive search tools through a user-friendly interface, which comes close to the functionalities of modern web search engines (e.g., keyword search, query auto-completion and related query suggestion), and provides a great benefit of taking user feedback on the current web into account also for web archive search. Through extensive experiments, we conduct quantitative and qualitative analyses in order to provide insights that enable further research on and practical applications of web archives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://archive.org/web/.

  2. 2.

    http://timetravel.mementoweb.org.

  3. 3.

    https://archive-it.org.

  4. 4.

    http://alexandria-project.eu/archivesearch/.

  5. 5.

    https://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics.

  6. 6.

    http://www.nydailynews.com/entertainment/gossip/linda-perry-slams-lady-gaga-ar ticle-1.2500319.

  7. 7.

    http://www.thefullwiki.org/Battle_of_Rathmines.

  8. 8.

    http://irelandinhistory.blogspot.de/2014/08/blog-post_11.html.

References

  1. Costa, M., Gomes, D., Couto, F., Silva, M.: A survey of web archive search architectures. In: Proceedings of the 22nd International Conference on World Wide Web (Companion), WWW 2013, pp. 1045–1050 (2013)

    Google Scholar 

  2. Dougherty, M., van den Heuvel, C.: Historical infrastructures for web archiving: annotation of ephemeral collections for researchers and cultural heritage institutions. In: Proceedings of Media in Transition MIT6 Conference 2009 (2009)

    Google Scholar 

  3. Gomes, D., Miranda, J.A., Costa, M.: A survey on web archiving initiatives. In: Proceedings of the 15th International Conference on Theory, Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries, TPDL 2011, pp. 408–420 (2011)

    Google Scholar 

  4. Kanhabua, N., Nørvåg, K.: Exploiting time-based synonyms in searching document archives. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL 2010, pp. 79–88 (2010)

    Google Scholar 

  5. Miliaraki, I., Blanco, R., Lalmas, M.: From “selena gomez” to “marlon brando”: understanding explorative entity search. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, pp. 765–775 (2015)

    Google Scholar 

  6. Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 509–518 (2008)

    Google Scholar 

  7. Nguyen, T.N., Kanhabua, N., Niederée, C., Zhu, X.: A time-aware random walk model for finding important documents in web archives. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 915–918 (2015)

    Google Scholar 

  8. SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: how many resources shared on social media have been lost? In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL 2012. LNCS, vol. 7489, pp. 125–137. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  9. Singh, J., Nejdl, W., Anand, A.: History by diversity: helping historians search news archives. In: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, CHIIR 2016, pp. 183–192 (2016)

    Google Scholar 

  10. Tran, T.A., Niederée, C., Kanhabua, N., Gadiraju, U., Anand, A.: Balancing novelty, salience: adaptive learning to rank entities for timeline summarization of high-impact events. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM 2015, pp. 1201–1210 (2015)

    Google Scholar 

  11. Yin, X., Shah, S.: Building taxonomy of web search intents for name entity queries. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1001–1010 (2010)

    Google Scholar 

Download references

Acknowledgments

This work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under the grant number 339233.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nattiya Kanhabua .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Kanhabua, N., Kemkes, P., Nejdl, W., Nguyen, T.N., Reis, F., Tran, N.K. (2016). How to Search the Internet Archive Without Indexing It. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43997-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43996-9

  • Online ISBN: 978-3-319-43997-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics