Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Samar, Thaer; Traub, Myriam C.; van Ossenbruggen, Jacco; de Vries, Arjen P.

doi:10.1007/978-3-319-43997-6_11

Thaer Samar¹⁷,
Myriam C. Traub¹⁷,
Jacco van Ossenbruggen¹⁷ &
…
Arjen P. de Vries¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9819))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1559 Accesses
1 Citations

Abstract

Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes as possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics that were popular on the Web; both at the global level (entire Web) and at the national level (.nl domain): Google Trends, WikiStats, and queries collected from users of the Dutch historic newspaper archive. The two crawls are different in terms of their size, number of included websites and domains. To allow fair comparison between the two crawls, we created sub-collections from the Common Crawl dataset based on the .nl domain and the KB seeds. Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl. Surprisingly, this is not limited to popular topics from the entire Web but also applies to topics that were popular in the .nl domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
www.kb.nl.
2.
www.delpher.nl.
3.
http://archive.org/web/researcher/ArcFileFormat.php.
4.
http://commoncrawl.org/.
5.
http://jsoup.org/.
6.
http://www.google.com/trends/topcharts?hl=en#date=2014&geo=.
7.
http://dumps.wikimedia.org/other/pagecounts-raw/.
8.
These projects are: wikibooks, wiktionary, wikinews, wikivoyage, wikiquote, wikisource, wikiversity, and wikipedia.
9.
http://www.xml.com/pub/a/2001/05/30/didl.html.

References

Baeza-Yates, R.A., Poblete, B.: Evolution of the Chilean web structure composition. In: LA-WEB, pp. 11–13 (2003)
Google Scholar
Broder, A.Z., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Janet, L.: Wiener.: graph structure in the web. Comput. Netw. 33(1–6), 309–320 (2000)
Article Google Scholar
Brügger, N.: Historical network analysis of the web. Soc. Sci. Comput. Rev. 31(3), 306–321 (2013)
Article Google Scholar
Craswell, N., Hawking, D., Robertson, S.: Effective site finding using link anchor information. In: SIGIR, pp. 250–257. ACM (2001)
Google Scholar
Donato, D., Leonardi, S., Millozzi, S., Tsaparas, P.: Mining the inner structure of the web graph. In: WebDB, pp. 145–150 (2005)
Google Scholar
Dou, Z., Song, R., Nie, J.-Y., Wen, J.-R.: Using anchor texts with their hyperlink structure for web search. In: SIGIR, pp. 227–234 (2009)
Google Scholar
Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: SIGIR, pp. 459–460 (2003)
Google Scholar
Fujii, A.: Modeling anchor text and classifying queries to enhance web document retrieval. In: WWW, pp. 337–346 (2008)
Google Scholar
Huurdeman, H.C., Kamps, J., Samar, T., de Vries, A.P., Ben-David, A., Rogers, R.A.: Lost but not forgotten: finding pages on the unarchived web. Int. J. Digit. Libr. 16, 247–265 (2015)
Article Google Scholar
Jin, R., Hauptmann, A.G., Zhai, C.: Title language model for information retrieval. In: SIGIR, 11–15 August 2002, Tampere, Finland, pp. 42–48 (2002)
Google Scholar
Kamps, J.: Web-centric language models. In: CIKM, pp. 307–308 (2005)
Google Scholar
Kanhabua, N., Nejdl, W.: On the value of temporal anchor texts in wikipedia. In: SIGIR Workshop on Temporal, Social and Spatially-Aware Information Access (2014)
Google Scholar
Klein, M., Nelson, M.L.: Moved but not gone: an evaluation of real-time methods for discovering replacement web pages. Int. J. Digit. Libr. 14, 17–38 (2014)
Article Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46, 604–632 (1999)
Article MathSciNet MATH Google Scholar
Koolen, M., Kamps, J.: The importance of anchor text for ad hoc search revisited. In: SIGIR, pp. 122–129 (2010)
Google Scholar
Kraft, R., Zien, J.: Mining anchor text for query refinement. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 666–674. ACM, New York (2004). doi:10.1145/988672.988763. ISBN: 1-58113-844-X
Masanès, J.: Web Archiving. Springer, Berlin (2006)
Book Google Scholar
Metzler, D., Novak, J., Cui, H., Reddy, S.: Building enriched document representations using aggregated anchor text. In: SIGIR (2009)
Google Scholar
Meusel, R., Vigna, S., Lehmberg, O., Bizer, C.: Graph structure in the web - revisited: a trick of the heavy tail. In: WWW, pp. 427–432 (2014)
Google Scholar
Mühleisen, H.: Wikistats – wikipedia pageviews (2013). http://wikistats.ins.cwi.nl
Ntoulas, A., Cho, J., Olston, C.: What’s new on the web? The evolution of the web from a search engine perspective. In: WWW, pp. 1–12 (2004)
Google Scholar
Rauber, A., Bruckner, R.M., Aschenbrenner, A., Witvoet, O., Kaiser, M.: Uncovering information hidden in web archives: a glimpse at web analysis building on data warehouses. D-Lib Mag. 8(12), 1082–9873 (2002). doi:10.1045/december2002-rauber
Google Scholar
Ángeles Serrano, M., Maguitman, A.G., Boguñá, M., Fortunato, S., Vespignani, A.: Decoding the structure of the www: a comparative analysis of web crawls. ACM Trans. Web (TWEB) 1(2), 10 (2007). doi:10.1145/1255438.1255442
Article Google Scholar
Zheng, S., Pavel Dmitriev, C., Giles, L.: Graph-based seed selection for web-scale crawlers. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM, Hong Kong, China, 2–6 November 2009, pp. 1967–1970 (2009)
Google Scholar
Zhu, J.J.H., Meng, T., Xie, Z., Li, G., Li, X.: A teapot graph and its hierarchical structure of the Chinese web. In: WWW, pp. 1133–1134 (2008)
Google Scholar

Download references

Acknowledgments

We would like to thank the National Library of the Netherlands for their support. This research was funded by the Netherlands Organization for Scientific Research (NWO CATCH program, WebART project), and Dutch COMMIT/ program (SEALINCMedia project). Part of the analysis work was carried out on the Dutch e-infrastructure with the support of the SURF Foundation.

Author information

Authors and Affiliations

Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
Thaer Samar, Myriam C. Traub & Jacco van Ossenbruggen
Radboud University, Nijmegen, The Netherlands
Arjen P. de Vries

Authors

Thaer Samar
View author publications
You can also search for this author in PubMed Google Scholar
Myriam C. Traub
View author publications
You can also search for this author in PubMed Google Scholar
Jacco van Ossenbruggen
View author publications
You can also search for this author in PubMed Google Scholar
Arjen P. de Vries
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thaer Samar .

Editor information

Editors and Affiliations

Universität Duisburg-Essen , Duisburg, Germany
Norbert Fuhr
Hungarian Academy of Science , Budapest, Hungary
László Kovács
Leibniz Universität Hannover , Hannover, Germany
Thomas Risse
Leibniz Universität Hannover , Hannover, Germany
Wolfgang Nejdl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Samar, T., Traub, M.C., van Ossenbruggen, J., de Vries, A.P. (2016). Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-43997-6_11
Published: 10 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43996-9
Online ISBN: 978-3-319-43997-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics