Advertisement

Web Crawling

  • Bing LiuEmail author
  • Filippo Menczer
Chapter
Part of the Data-Centric Systems and Applications book series (DCSA)

Abstract

Web crawlers, also known as spiders or robots, are programs that automatically download Web pages. Since information on the Web is scattered among billions of pages served by millions of servers around the globe, users who browse the Web can follow hyperlinks to access information, virtually moving from one page to the next. A crawler can visit many sites to collect information that can be analyzed and mined in a central location, either online (as it is downloaded) or off-line (after it is stored).

Keywords

Search Engine Priority Queue Cosine Similarity Domain Name System Anchor Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. 1.
    Aggarwal, C., F. Al-Garawi, and P. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of 10th Internaitonal Conference on World Wide Web (WWW-2001), 2001.Google Scholar
  2. 2.
    Akavipat, R., L. Wu, and F. Menczer. Small world peer networks in distributed Web search. In Proceedings of Alternative Track Papers and Posters Proceedings of International Conference on World Wide Web, 2004.Google Scholar
  3. 3.
    Amento, B., L. Terveen, and W. Hill. Does “authority” mean quality? Predicting expert quality ratings of Web documents. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-2000), 2000.Google Scholar
  4. 4.
    Arasu, A., J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 2001, 1(1): p. 2–43.CrossRefGoogle Scholar
  5. 5.
    Bharat, K. and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-1998), 1998.Google Scholar
  6. 6.
    Brin, S. and P. Lawrence. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 1998, 30(1–7): p. 107–117.Google Scholar
  7. 7.
    Broder, A., R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 2000, 33(1–6): p. 309–320.CrossRefGoogle Scholar
  8. 8.
    Chakrabarti, S. Mining the Web: discovering knowledge from hypertext data. 2003: Morgan Kaufmann Publishers.Google Scholar
  9. 9.
    Chakrabarti, S., B. Dom, S. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. Computer, 2002, 32(8): p. 60–67.CrossRefGoogle Scholar
  10. 10.
    Chakrabarti, S., B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks, 1998, 30(1–7): p. 65–74.Google Scholar
  11. 11.
    Chakrabarti, S., M. Van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 1999, 31(11–16): p. 1623–1640.CrossRefGoogle Scholar
  12. 12.
    Chen, H., Y. Chung, M. Ramsey, and C. Yang. A smart itsy bitsy spider for the web. Journal of the American Society for Information Science, 1998, 49(7): p. 604–618.CrossRefGoogle Scholar
  13. 13.
    Cho, J. and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of International Conference on Very Large Data Bases (VLDB-2000), 2000.Google Scholar
  14. 14.
    Cho, J., H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 1998, 30(1–7): p. 161–172.Google Scholar
  15. 15.
    Davison, B. Topical locality in the Web. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2000), 2000.Google Scholar
  16. 16.
    De Bra, P. and R. Post. Information retrieval in the World-Wide Web: making client-based searching feasible. Computer Networks, 1994, 27(2): p. 183–192.Google Scholar
  17. 17.
    Degeratu, M., G. Pant, and F. Menczer. Latency-dependent fitness in evolutionary multithreaded web agents. In Proceedings of GECCO Workshop on Evolutionary Computation and Multi-Agent Systems, 2001.Google Scholar
  18. 18.
    Diligenti, M., F. Coetzee, S. Lawrence, C. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of International Conference on Very Large Data Bases (VLDB-2000), 2000.Google Scholar
  19. 19.
    Eichmann, D. Ethical Web agents. Computer Networks and ISDN Systems, 1995, 28(1–2): p. 127–136.CrossRefGoogle Scholar
  20. 20.
    Fetterly, D., M. Manasse, M. Najork, and J. Wiener. A large scale study of the evolution of Web pages. Software: Practice and Experience, 2004, 34(2): p. 213–237.CrossRefGoogle Scholar
  21. 21.
    Gasparetti, F. and A. Micarelli. Swarm intelligence: Agents for adaptive web search. In Proceedings of European Conf. on Artificial Intelligence (ECAI- 2004), 2004.Google Scholar
  22. 22.
    Henzinger, M., A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. Computer Networks, 1999, 31(11–16): p. 1291–1303.CrossRefGoogle Scholar
  23. 23.
    Henzinger, M., A. Heydon, M. Mitzenmacher, and M. Najork. On nearuniform URL sampling. Computer Networks, 2000, 33(1–6): p. 295–308.CrossRefGoogle Scholar
  24. 24.
    Hersovici, M., M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm. An application: tailored Web site mapping. Computer Networks, 1998, 30(1–7): p. 317–326.Google Scholar
  25. 25.
    Heydon, A. and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 1999, 2(4): p. 219–229.CrossRefGoogle Scholar
  26. 26.
    Jagatic, T., N. Johnson, M. Jakobsson, and F. Menczer. Social phishing. Communications of the ACM, 2007, 50(10): p. 94–100.CrossRefGoogle Scholar
  27. 27.
    Kaelbling, L., M. Littman, and A. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 1996, 4: p. 237–285.Google Scholar
  28. 28.
    Kleinberg, J. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 1999, 46(5): p. 604–632.zbMATHCrossRefMathSciNetGoogle Scholar
  29. 29.
    Lawrence, S., L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. Computer, 2002, 32(6): p. 67–71.CrossRefGoogle Scholar
  30. 30.
    Lin, L. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 1992, 8(3): p. 293–321.Google Scholar
  31. 31.
    Lu, J. and J. Callan. Content-based retrieval in hybrid peer-to-peer networks. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2003), 2003.Google Scholar
  32. 32.
    Maguitman, A., F. Menczer, H. Roinestad, and A. Vespignani. Algorithmic detection of semantic similarity. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.Google Scholar
  33. 33.
    McCallum, A., K. Nigam, J. Rennie, and K. Seymore. A machine learning approach to building domain-specific search engines. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-1999), 1999.Google Scholar
  34. 34.
    Menczer, F. ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery. In Proceedings of International Conference on Machine Learning (ICML-1997), 1997.Google Scholar
  35. 35.
    Menczer, F. Lexical and semantic clustering by web links. Journal of the American Society for Information Science and Technology, 2004, 55(14): p. 1261–1269.CrossRefGoogle Scholar
  36. 36.
    Menczer, F. Mapping the semantics of web text and links. Internet Computing, IEEE, 2005, 9(3): p. 27–36.CrossRefGoogle Scholar
  37. 37.
    Menczer, F. and R. Belew. Adaptive retrieval agents: Internalizing localGoogle Scholar
  38. 38.
    context and scaling up to the Web. Machine Learning, 2000, 39(2): p. 203–242.Google Scholar
  39. 39.
    Menczer, F., G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology (TOIT), 2004, 4(4): p. 378–419.CrossRefGoogle Scholar
  40. 40.
    Menczer, F., G. Pant, P. Srinivasan, and M. Ruiz. Evaluating topic-driven Web crawlers. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-2001), 2001.Google Scholar
  41. 41.
    Micarelli, A. and F. Gasparetti. Adaptive focused crawling. In P. Brusilovsky, W. Nejdl, and A. Kobsa (eds.), Adaptive Web., 2007: Springer-Verlag.Google Scholar
  42. 42.
    Najork, M. and J. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.Google Scholar
  43. 43.
    Ntoulas, A., J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.Google Scholar
  44. 44.
    Pant, G. Deriving link-context from HTML tag tree. In Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’03), 2003.Google Scholar
  45. 45.
    Pant, G., S. Bradshaw, and F. Menczer. Search engine-crawler symbiosis: Adapting to community interests. Research and AdvancedTechnology for Digital Libraries, 2004: p. 221–232.Google Scholar
  46. 46.
    Pant, G. and F. Menczer. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 2002, 5(2): p. 221–229.CrossRefGoogle Scholar
  47. 47.
    Pant, G. and F. Menczer. Topical crawling for business intelligence. Research and Advanced Technology for Digital Libraries, 2004: p. 233–244.Google Scholar
  48. 48.
    Pant, G. and P. Srinivasan. Learning to crawl: Comparing classification schemes. ACM Transactions on Information Systems (TOIS), 2005, 23(4): p. 430–462.CrossRefGoogle Scholar
  49. 49.
    Pant, G., P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In Proceedings of WWW-02 Workshop on Web Dynamics, 2002.Google Scholar
  50. 50.
    Pastor-Satorras, R. and A. Vespignani. Evolution and structure of the Internet: A statistical physics approach. 2004: Cambridge Univ Press.Google Scholar
  51. 51.
    Rennie, J. and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proceedings of International Conference on Machine Learning (ICML-1999), 1999.Google Scholar
  52. 52.
    Rijsbergen, C.v. Information Retrieval. 1979: Butterworths. Second edition.Google Scholar
  53. 53.
    Rumelhart, D., G. Hinton, and R. Williams. Learning internal representations by error propagation. D. Rumelhart and J. McClelland (eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1996.Google Scholar
  54. 54.
    Srinivasan, P., F. Menczer, and G. Pant. A general evaluation framework for topical crawlers. Information Retrieval, 2005, 8(3): p. 417–447.CrossRefGoogle Scholar
  55. 55.
    Srinivasan, P., J. Mitchell, O. Bodenreider, G. Pant, F. Menczer, and P. Acd. Web crawling agents for retrieving biomedical information. In Proceedings of Workshop on Agents in Bioinformatics (NETTAB’02), 2002.Google Scholar
  56. 56.
    Von Ahn, L., M. Blum, N. Hopper, and J. Langford. CAPTCHA: Using hard AI problems for security. Advances in Cryptology—EUROCRYPT-2003, 2003: p. 646–646.Google Scholar
  57. 57.
    Witten, I., C. Nevill-Manning, and S. Cunningham. Building a digital library for computer science research: technical issues. Australian Computer Science Communications, 1996, 18 p. 534–542.Google Scholar
  58. 58.
    Wu, L., R. Akavipat, and F. Menczer. 6S: Distributing crawling and searching across Web peers. In Proceedings of IASTED Int. Conf. on Web Technologies, Applications, and Services, 2005.Google Scholar
  59. 59.
    Wu, L., R. Akavipat, and F. Menczer. Adaptive query routing in peer Web search. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of Illinois, ChicagoChicagoUSA

Personalised recommendations