Abstract
Significant parts of cultural heritage are produced on the web during the last decades. While easy accessibility to the current web is a good baseline, optimal access to the past web faces several challenges. This includes dealing with large-scale web archive collections and lacking of usage logs that contain implicit human feedback most relevant for today’s web search. In this paper, we propose an entity-oriented search system to support retrieval and analytics on the Internet Archive. We use Bing to retrieve a ranked list of results from the current web. In addition, we link retrieved results to the WayBack Machine; thus allowing keyword search on the Internet Archive without processing and indexing its raw archived content. Our search system complements existing web archive search tools through a user-friendly interface, which comes close to the functionalities of modern web search engines (e.g., keyword search, query auto-completion and related query suggestion), and provides a great benefit of taking user feedback on the current web into account also for web archive search. Through extensive experiments, we conduct quantitative and qualitative analyses in order to provide insights that enable further research on and practical applications of web archives.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Costa, M., Gomes, D., Couto, F., Silva, M.: A survey of web archive search architectures. In: Proceedings of the 22nd International Conference on World Wide Web (Companion), WWW 2013, pp. 1045–1050 (2013)
Dougherty, M., van den Heuvel, C.: Historical infrastructures for web archiving: annotation of ephemeral collections for researchers and cultural heritage institutions. In: Proceedings of Media in Transition MIT6 Conference 2009 (2009)
Gomes, D., Miranda, J.A., Costa, M.: A survey on web archiving initiatives. In: Proceedings of the 15th International Conference on Theory, Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries, TPDL 2011, pp. 408–420 (2011)
Kanhabua, N., Nørvåg, K.: Exploiting time-based synonyms in searching document archives. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL 2010, pp. 79–88 (2010)
Miliaraki, I., Blanco, R., Lalmas, M.: From “selena gomez” to “marlon brando”: understanding explorative entity search. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, pp. 765–775 (2015)
Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 509–518 (2008)
Nguyen, T.N., Kanhabua, N., Niederée, C., Zhu, X.: A time-aware random walk model for finding important documents in web archives. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 915–918 (2015)
SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: how many resources shared on social media have been lost? In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL 2012. LNCS, vol. 7489, pp. 125–137. Springer, Heidelberg (2012)
Singh, J., Nejdl, W., Anand, A.: History by diversity: helping historians search news archives. In: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, CHIIR 2016, pp. 183–192 (2016)
Tran, T.A., Niederée, C., Kanhabua, N., Gadiraju, U., Anand, A.: Balancing novelty, salience: adaptive learning to rank entities for timeline summarization of high-impact events. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM 2015, pp. 1201–1210 (2015)
Yin, X., Shah, S.: Building taxonomy of web search intents for name entity queries. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1001–1010 (2010)
Acknowledgments
This work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under the grant number 339233.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Kanhabua, N., Kemkes, P., Nejdl, W., Nguyen, T.N., Reis, F., Tran, N.K. (2016). How to Search the Internet Archive Without Indexing It. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-43997-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43996-9
Online ISBN: 978-3-319-43997-6
eBook Packages: Computer ScienceComputer Science (R0)