How to Search the Internet Archive Without Indexing It

  • Nattiya Kanhabua
  • Philipp Kemkes
  • Wolfgang Nejdl
  • Tu Ngoc Nguyen
  • Felipe Reis
  • Nam Khanh Tran
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9819)

Abstract

Significant parts of cultural heritage are produced on the web during the last decades. While easy accessibility to the current web is a good baseline, optimal access to the past web faces several challenges. This includes dealing with large-scale web archive collections and lacking of usage logs that contain implicit human feedback most relevant for today’s web search. In this paper, we propose an entity-oriented search system to support retrieval and analytics on the Internet Archive. We use Bing to retrieve a ranked list of results from the current web. In addition, we link retrieved results to the WayBack Machine; thus allowing keyword search on the Internet Archive without processing and indexing its raw archived content. Our search system complements existing web archive search tools through a user-friendly interface, which comes close to the functionalities of modern web search engines (e.g., keyword search, query auto-completion and related query suggestion), and provides a great benefit of taking user feedback on the current web into account also for web archive search. Through extensive experiments, we conduct quantitative and qualitative analyses in order to provide insights that enable further research on and practical applications of web archives.

References

  1. 1.
    Costa, M., Gomes, D., Couto, F., Silva, M.: A survey of web archive search architectures. In: Proceedings of the 22nd International Conference on World Wide Web (Companion), WWW 2013, pp. 1045–1050 (2013)Google Scholar
  2. 2.
    Dougherty, M., van den Heuvel, C.: Historical infrastructures for web archiving: annotation of ephemeral collections for researchers and cultural heritage institutions. In: Proceedings of Media in Transition MIT6 Conference 2009 (2009)Google Scholar
  3. 3.
    Gomes, D., Miranda, J.A., Costa, M.: A survey on web archiving initiatives. In: Proceedings of the 15th International Conference on Theory, Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries, TPDL 2011, pp. 408–420 (2011)Google Scholar
  4. 4.
    Kanhabua, N., Nørvåg, K.: Exploiting time-based synonyms in searching document archives. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL 2010, pp. 79–88 (2010)Google Scholar
  5. 5.
    Miliaraki, I., Blanco, R., Lalmas, M.: From “selena gomez” to “marlon brando”: understanding explorative entity search. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, pp. 765–775 (2015)Google Scholar
  6. 6.
    Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 509–518 (2008)Google Scholar
  7. 7.
    Nguyen, T.N., Kanhabua, N., Niederée, C., Zhu, X.: A time-aware random walk model for finding important documents in web archives. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 915–918 (2015)Google Scholar
  8. 8.
    SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: how many resources shared on social media have been lost? In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL 2012. LNCS, vol. 7489, pp. 125–137. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  9. 9.
    Singh, J., Nejdl, W., Anand, A.: History by diversity: helping historians search news archives. In: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, CHIIR 2016, pp. 183–192 (2016)Google Scholar
  10. 10.
    Tran, T.A., Niederée, C., Kanhabua, N., Gadiraju, U., Anand, A.: Balancing novelty, salience: adaptive learning to rank entities for timeline summarization of high-impact events. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM 2015, pp. 1201–1210 (2015)Google Scholar
  11. 11.
    Yin, X., Shah, S.: Building taxonomy of web search intents for name entity queries. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1001–1010 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Nattiya Kanhabua
    • 1
  • Philipp Kemkes
    • 2
  • Wolfgang Nejdl
    • 2
  • Tu Ngoc Nguyen
    • 2
  • Felipe Reis
    • 2
  • Nam Khanh Tran
    • 2
  1. 1.Department of Computer ScienceAalborg UniversityAalborgDenmark
  2. 2.L3S Research Center/Leibniz Universität HannoverHannoverGermany

Personalised recommendations