Advertisement

Crawling Web Pages with Support for Client-Side Dynamism

  • Manuel Álvarez
  • Alberto Pan
  • Juan Raposo
  • Justo Hidalgo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4016)

Abstract

There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually known as the Hidden Web. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the information placed in web pages with support for client-side dynamism, dealing with aspects such as JavaScript technology, non-standard session maintenance mechanisms, client redirections, pop-up menus, etc. Our approach leverages current browser APIs and implements novel crawling models and algorithms.

Keywords

Master List Session Object Script Code Corporate Search Navigation Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bergman, M.: The Deep Web. Surfacing Hidden Value, http://www.brightplanet.com/technology/deepweb.asp
  2. 2.
    Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Search Engine. In: Proceedings of the 7th International World Wide Web Conference (1998)Google Scholar
  3. 3.
    Ipeirotis, P., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: Proceedings of the 28th International Conference on Very Large Databases, VLDB 2002 (2002)Google Scholar
  4. 4.
    Microsoft Internet Explorer WebBrowser Control, http://www.microsoft.com/windows/ie
  5. 5.
    Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context, EISIC 2002 (2002)Google Scholar
  6. 6.
    Raghavan, S., García-Molina, H.: Crawling the Hidden Web. In: Proceedings of the 27th International Conference on Very Large Databases (2001)Google Scholar
  7. 7.
    Mozilla Rhino - JavaScript Engine (Java), http://www.mozilla.org/rhino
  8. 8.
    Mozilla SpiderMonkey – JavaScript engine (C), http://www.mozilla.org/js/spidermonkey
  9. 9.
    WebCopier – Feel the Internet in your Hands, http://www.maximumsoft.com
  10. 10.
  11. 11.
  12. 12.
    Naming and Addressing: http://www.w3.org/Addressing

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Manuel Álvarez
    • 1
  • Alberto Pan
    • 1
  • Juan Raposo
    • 1
  • Justo Hidalgo
    • 2
  1. 1.Department of Information and Communications TechnologiesUniversity of A CoruñaA CoruñaSpain
  2. 2.Denodo Technologies IncMadridSpain

Personalised recommendations