Advertisement

Hypergraph-Theoretic Partitioning Models for Parallel Web Crawling

  • Ata Turk
  • B. Barla Cambazoglu
  • Cevdet Aykanat
Conference paper

Abstract

Parallel web crawling is an important technique employed by large-scale search engines for content acquisition. A commonly used inter-processor coordination scheme in parallel crawling systems is the link exchange scheme, where discovered links are communicated between processors. This scheme can attain the coverage and quality level of a serial crawler while avoiding redundant crawling of pages by different processors. The main problem in the exchange scheme is the high inter-processor communication overhead. In this work, we propose a hypergraph model that reduces the communication overhead associated with link exchange operations in parallel web crawling systems by intelligent assignment of sites to processors. Our hypergraph model can correctly capture and minimize the number of network messages exchanged between crawlers. We evaluate the performance of our models on four benchmark datasets. Compared to the traditional hash-based assignment approach, significant performance improvements are observed in reducing the inter-processor communication overhead.

References

  1. 1.
    Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: IRLbot: scaling to 6 billion pages and beyond. In: Proceedings of the 17th International Conference on World Wide Web, pp. 427–436 (2008)Google Scholar
  2. 2.
    Baeza-Yates, R., Castillo, C., Junqueira, F., Plachouras, V., Silvestri, F.: Challenges in distributed information retrieval. In: International Conference on Data Engineering, pp. 6–20 (2007)Google Scholar
  3. 3.
    Cambazoglu, B.B., Plachouras, V., Junqueira, F., Telloli, L.: On the feasibility of geographically distributed web crawling. In: Proceedings of the 3rd International Conference on Scalable Information Systems, pp. 1–10 (2008)Google Scholar
  4. 4.
    Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the 11th Int’l Conference on World Wide Web, pp. 124–135 (2002)Google Scholar
  5. 5.
    Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the 10th International Conference on World Wide Web, pp. 106–113 (2001)Google Scholar
  6. 6.
    Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)CrossRefGoogle Scholar
  7. 7.
    Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: Proceedings of the 18th International Conference on Data Engineering, pp. 357–368 (2002)Google Scholar
  8. 8.
    Teng, S.-H., Lu, Q., Eichstaedt, M., Ford, D., Lehman, T.: Collaborative web crawling: information gathering/processing over Internet. In: Proceedings of the 32nd Annual Hawaii International Conference on System Sciences (1999)Google Scholar
  9. 9.
    Cambazoglu B, B., Turk, A., Aykanat, C.: Data-parallel web crawling models. Lect. Notes. Comput. Sci. 3280, 801–809 (2004)CrossRefGoogle Scholar
  10. 10.
    Exposto, J., Macedo, J., Pina, A., Alves, A., Rufino, J.: Efficient partitioning strategies for distributed web crawling. Lect. Notes. Comput. Sci. 5200, 544–553 (2008)CrossRefGoogle Scholar
  11. 11.
    Berge, C.: Graphs and Hypergraphs. North-Holland Publishing Company, New York (1973)Google Scholar
  12. 12.
    Lengauer, T.: Combinatorial Algorithms for Integrated Circuit Layout. Wiley, UK (1990)Google Scholar
  13. 13.
    Karypis, G., Kumar, V.: Multilevel k-way hypergraph partitioning. In: Proceedings of the 36th annual ACM/IEEE Design Automation Conference, pp. 343–348 (1999)Google Scholar
  14. 14.
    Çatalyürek, U.V., Aykanat, C.: PaToH: a multilevel hypergraph partitioning tool, version 3.0. Technical report, Bilkent University. Department of Computer Engineering (1999)Google Scholar
  15. 15.
    Boldi, P., Codenotti, B., Santini, M., Vigna, S.: UbiCrawler: a scalable fully distributed web crawler. Softw. Pract. Experience 34(8), 711–726 (2004)CrossRefGoogle Scholar
  16. 16.
    Boldi, P., Vigna, S.: The WebGraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web, pp. 595–602 (2004)Google Scholar
  17. 17.
    Jean-Loup, G., Latapy, M., Viennot, L.: Efficient and simple encodings for the web graph. In: Proceedings of the 3rd International Conference on Advances in Web-Age Information Management, pp. 328–337 (2002)Google Scholar

Copyright information

© Springer-Verlag London Limited  2011

Authors and Affiliations

  • Ata Turk
    • 1
  • B. Barla Cambazoglu
    • 2
  • Cevdet Aykanat
    • 1
  1. 1.Computer Engineering DepartmentBilkent UniversityAnkaraTurkey
  2. 2.Yahoo! ResearchBarcelonaSpain

Personalised recommendations