How to Search the Internet Archive Without Indexing It

Kanhabua, Nattiya; Kemkes, Philipp; Nejdl, Wolfgang; Nguyen, Tu Ngoc; Reis, Felipe; Tran, Nam Khanh

doi:10.1007/978-3-319-43997-6_12

Nattiya Kanhabua¹⁷,
Philipp Kemkes¹⁸,
Wolfgang Nejdl¹⁸,
Tu Ngoc Nguyen¹⁸,
Felipe Reis¹⁸ &
…
Nam Khanh Tran¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9819))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1823 Accesses
10 Citations
3 Altmetric

Abstract

Significant parts of cultural heritage are produced on the web during the last decades. While easy accessibility to the current web is a good baseline, optimal access to the past web faces several challenges. This includes dealing with large-scale web archive collections and lacking of usage logs that contain implicit human feedback most relevant for today’s web search. In this paper, we propose an entity-oriented search system to support retrieval and analytics on the Internet Archive. We use Bing to retrieve a ranked list of results from the current web. In addition, we link retrieved results to the WayBack Machine; thus allowing keyword search on the Internet Archive without processing and indexing its raw archived content. Our search system complements existing web archive search tools through a user-friendly interface, which comes close to the functionalities of modern web search engines (e.g., keyword search, query auto-completion and related query suggestion), and provides a great benefit of taking user feedback on the current web into account also for web archive search. Through extensive experiments, we conduct quantitative and qualitative analyses in order to provide insights that enable further research on and practical applications of web archives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Costa, M., Gomes, D., Couto, F., Silva, M.: A survey of web archive search architectures. In: Proceedings of the 22nd International Conference on World Wide Web (Companion), WWW 2013, pp. 1045–1050 (2013)
Google Scholar
Dougherty, M., van den Heuvel, C.: Historical infrastructures for web archiving: annotation of ephemeral collections for researchers and cultural heritage institutions. In: Proceedings of Media in Transition MIT6 Conference 2009 (2009)
Google Scholar
Gomes, D., Miranda, J.A., Costa, M.: A survey on web archiving initiatives. In: Proceedings of the 15th International Conference on Theory, Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries, TPDL 2011, pp. 408–420 (2011)
Google Scholar
Kanhabua, N., Nørvåg, K.: Exploiting time-based synonyms in searching document archives. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL 2010, pp. 79–88 (2010)
Google Scholar
Miliaraki, I., Blanco, R., Lalmas, M.: From “selena gomez” to “marlon brando”: understanding explorative entity search. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, pp. 765–775 (2015)
Google Scholar
Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 509–518 (2008)
Google Scholar
Nguyen, T.N., Kanhabua, N., Niederée, C., Zhu, X.: A time-aware random walk model for finding important documents in web archives. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 915–918 (2015)
Google Scholar
SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: how many resources shared on social media have been lost? In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL 2012. LNCS, vol. 7489, pp. 125–137. Springer, Heidelberg (2012)
Chapter Google Scholar
Singh, J., Nejdl, W., Anand, A.: History by diversity: helping historians search news archives. In: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, CHIIR 2016, pp. 183–192 (2016)
Google Scholar
Tran, T.A., Niederée, C., Kanhabua, N., Gadiraju, U., Anand, A.: Balancing novelty, salience: adaptive learning to rank entities for timeline summarization of high-impact events. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM 2015, pp. 1201–1210 (2015)
Google Scholar
Yin, X., Shah, S.: Building taxonomy of web search intents for name entity queries. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1001–1010 (2010)
Google Scholar

Download references

Acknowledgments

This work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under the grant number 339233.

Author information

Authors and Affiliations

Department of Computer Science, Aalborg University, Aalborg, Denmark
Nattiya Kanhabua
L3S Research Center/Leibniz Universität Hannover, Hannover, Germany
Philipp Kemkes, Wolfgang Nejdl, Tu Ngoc Nguyen, Felipe Reis & Nam Khanh Tran

Authors

Nattiya Kanhabua
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Kemkes
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Nejdl
View author publications
You can also search for this author in PubMed Google Scholar
Tu Ngoc Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Felipe Reis
View author publications
You can also search for this author in PubMed Google Scholar
Nam Khanh Tran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nattiya Kanhabua .

Editor information

Editors and Affiliations

Universität Duisburg-Essen , Duisburg, Germany
Norbert Fuhr
Hungarian Academy of Science , Budapest, Hungary
László Kovács
Leibniz Universität Hannover , Hannover, Germany
Thomas Risse
Leibniz Universität Hannover , Hannover, Germany
Wolfgang Nejdl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kanhabua, N., Kemkes, P., Nejdl, W., Nguyen, T.N., Reis, F., Tran, N.K. (2016). How to Search the Internet Archive Without Indexing It. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-43997-6_12
Published: 10 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43996-9
Online ISBN: 978-3-319-43997-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics