Current Challenges in Web Crawling

Shestakov, Denis

doi:10.1007/978-3-642-39200-9_49

Denis Shestakov¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7977))

Included in the following conference series:

International Conference on Web Engineering

4162 Accesses
3 Citations

Abstract

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.

Download to read the full chapter text

Chapter PDF

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

A Study on Different Types of Web Crawlers

Deep Web crawling: a survey

Article 05 June 2018

Keywords

References

Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)
Article MATH Google Scholar
Barabasi, A.-L.: Scale-Free networks: A decade and beyond. Science 325(5939), 412–413 (2009)
Article MathSciNet Google Scholar
Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.S.: The Web as a graph: measurements, models, and methods. In: Asano, T., Imai, H., Lee, D.T., Nakano, S.-I., Tokuyama, T. (eds.) COCOON 1999. LNCS, vol. 1627, pp. 1–17. Springer, Heidelberg (1999)
Chapter Google Scholar
Schonfeld, U., Shivakumar, N.: Sitemaps: Above and beyond the crawl of duty. In: Proc. of WWW 2009, pp. 991–1000 (2009)
Google Scholar
Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. JACM 55(5) (2008)
Google Scholar
Shestakov, D.: Sampling the national deep Web. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011, Part I. LNCS, vol. 6860, pp. 331–340. Springer, Heidelberg (2011)
Chapter Google Scholar
Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: Proc. of ICDE 2002, pp. 357–368 (2002)
Google Scholar
Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: IRLbot: Scaling to 6 billion pages and beyond. ACM Transactions on the Web 3(3) (2009)
Google Scholar
Hsieh, J., Gribble, S., Levy, H.: The architecture and implementation of an extensible web crawler. In: Proc. of NSDI 2010 (2010)
Google Scholar
Shestakov, D.: Deep Web: databases on the Web. Entry: Handbook of Research on Innovations in Database Technologies and Applications, pp. 581–588 (2009)
Google Scholar
Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep-Web crawl. In: Proc. of VLDB 2008, pp. 1241–1252 (2008)
Google Scholar
Shestakov, D.: On building a search interface discovery system. In: Proc. of VLDB Workshops 2009, pp. 81–93 (2009)
Google Scholar
Duda, C., Frey, G., Kossmann, D., Matter, R., Zhou, C.: AJAX crawl: Making AJAX applications searchable. In: Proc. of ICDE 2009, pp. 78–89 (2009)
Google Scholar
Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from web documents. In: Proc. of SIGKDD 2002, pp. 588–593 (2002)
Google Scholar
Shestakov, D.: Search interfaces on the Web: Querying and characterizing. Doctoral thesis, University of Turku (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Media Technology, Aalto University, P.O. Box 15500, FI-00076, Aalto, Finland
Denis Shestakov

Authors

Denis Shestakov
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Trento, Via Sommarive 5, 38123, Povo, TN, Italy
Florian Daniel
Department of Computer Science, Aalborg University, Selma Lagerloefs Vej 300, 9220, Aalborg, Denmark
Peter Dolog
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong, China
Qing Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shestakov, D. (2013). Current Challenges in Web Crawling. In: Daniel, F., Dolog, P., Li, Q. (eds) Web Engineering. ICWE 2013. Lecture Notes in Computer Science, vol 7977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39200-9_49

Download citation

DOI: https://doi.org/10.1007/978-3-642-39200-9_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39199-6
Online ISBN: 978-3-642-39200-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Current Challenges in Web Crawling

Abstract

Chapter PDF

Similar content being viewed by others

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

A Study on Different Types of Web Crawlers

Deep Web crawling: a survey

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Current Challenges in Web Crawling

Abstract

Chapter PDF

Similar content being viewed by others

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

A Study on Different Types of Web Crawlers

Deep Web crawling: a survey

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation