Advertisement

Analysing the Effectiveness of Crawlers on the Client-Side Hidden Web

  • Víctor M. Prieto
  • Manuel Álvarez
  • Rafael López-García
  • Fidel Cacheda
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 157)

Abstract

The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the “client-side” Hidden Web. To that end, we accomplish several tasks. First, we perform a thorough analysis of the different client-side technologies and the main features of the Web 2.0 pages in order to determine the initial levels of the aforementioned scale. Second, we submit a Web site whose purpose is to check what crawlers are capable of dealing with those technologies and features. Third, we propose several methods to evaluate the performance of the crawlers in the Web site and to classify them according to the levels of the scale. Fourth, we show the results of applying those methods to some OpenSource and commercial crawlers, as well as to the robots of the main Web search engines.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Álvarez, M., Pan, A., Raposo, J., Hidalgo, J.: Crawling Web Pages with Support for Client-Side Dynamism (2006)Google Scholar
  2. 2.
    Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V.: Crawling the Content Hidden Behind Web Forms. In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007, Part II. LNCS, vol. 4706, pp. 322–333. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  3. 3.
    Bergman, M.K.: The deep web: Surfacing hidden value (2000)Google Scholar
  4. 4.
    Chellapilla, K., Maykov, A.: A taxonomy of javascript redirection spam. In: Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2007, pp. 81–88 (2007)Google Scholar
  5. 5.
    Gyongyi, Z., Garcia-Molina, H.: Web spam taxonomy (2005)Google Scholar
  6. 6.
    Khare, R., Cutting, D.: Nutch: A flexible and scalable open-source web search engine. Technical report (2004)Google Scholar
  7. 7.
    Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1, 1241–1252 (2008)Google Scholar
  8. 8.
    Mesbah, A., Bozdag, E., van Deursen, A.: Crawling ajax by inferring user interface state changes. In: Web Engineering, ICWE 2008, pp. 122–134 (2008)Google Scholar
  9. 9.
    Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to heritrix, an archival quality web crawler. In: 4th International Web Archiving Workshop, IWAW 2004 (2004)Google Scholar
  10. 10.
    Pavuk Web page (2011), http://www.pavuk.org/
  11. 11.
    Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 129–138. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  12. 12.
  13. 13.
  14. 14.
  15. 15.
    Weideman, M., Schwenke, F.: The influence that JavaScript has on the visibility of a Website to search engines - a pilot study. Information Research 11(4) (July 2006)Google Scholar
  16. 16.
    Wu, B., Davison, B.D.: Cloaking and redirection: A preliminary study (2005)Google Scholar
  17. 17.
    Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proceedings of the 14th International World Wide Web Conference, pp. 820–829. ACM Press (2005)Google Scholar
  18. 18.
    Wu, B., Davison, B.D.: Detecting semantic cloaking on the web. In: Proceedings of the 15th International World Wide Web Conference, pp. 819–828. ACM Press (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Víctor M. Prieto
    • 1
  • Manuel Álvarez
    • 1
  • Rafael López-García
    • 1
  • Fidel Cacheda
    • 1
  1. 1.Department of Information and Communication TechnologiesUniversity of A CoruñaA CoruñaSpain

Personalised recommendations