Advertisement

Crawling the Content Hidden Behind Web Forms

  • Manuel Álvarez
  • Juan Raposo
  • Alberto Pan
  • Fidel Cacheda
  • Fernando Bellas
  • Víctor Carneiro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4706)

Abstract

The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is “hidden” behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hidden-web crawler able to access such content. DeepBot receives as input a set of domain definitions, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks.

Keywords

Keyword Query Form Field Visual Distance Query Form Domain Attribute 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Álvarez, M., Pan, A., Raposo, J., Hidalgo, J.: Crawling Web Pages with Support for Client-Side Dynamism. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 252–262. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Álvarez, M., Raposo J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V. Crawling the Content Hidden Behind Web Forms, http://www.tic.udc.es/~mad/publications/cchiddenbwf_extended.pdf
  3. 3.
    Bergholz, A., Chidlovskii, B.: Crawling for Domain-Specific Hidden Web Resources. In: Proceedings of the 4th Int. Conference on Web Information Systems Engineering (2003)Google Scholar
  4. 4.
    Bergman, M.: The Deep Web. Surfacing Hidden Value (2001), http://brightplanet.com/technology/deepweb.asp
  5. 5.
    Chang, C.-C.K., He, B., Patel, M., Zhang, Z.: Structured Databases on the Web: Observations and Implications. SIGMOD Record 33(3) (2004)Google Scholar
  6. 6.
    Chang, C.-C.K., He, B., Zhang, Z.: MetaQuerier over the Deep Web: Shallow Integration Across Holistic Sources. In: Proceedings of the VLDB Workshop on Information Integration on the Web (2004)Google Scholar
  7. 7.
    Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI-03 Workshop (2003)Google Scholar
  8. 8.
    Gravano, L., Ipeirotis, P., Sahami, M.: QProber: A System for Automatic Classification of Hidden-Web Databases. ACM Transactions on Information Systems 21(1) (2003)Google Scholar
  9. 9.
    He, H., Meng, W., Yu, C., Wu, Z.: Automatic Integration of Web Search Interfaces with WISE-Integrator. VLDB Journal 13(3), 256–273 (2004)CrossRefGoogle Scholar
  10. 10.
    Ipeirotis, P., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: Proceedings of the 28th Very Large DataBases Conference (2002)Google Scholar
  11. 11.
    Liddle, S., Embley, D., Scott, D., Yau Ho, S.: Extracting Data Behind Web Forms. In: Proceedings of the 28th Intl. Conference on Very Large Databases (2002)Google Scholar
  12. 12.
    Ntoulas, A., Zerfos, et al.: Downloading Textual Hidden Web Content Through Keyword Queries. In: Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries (2005)Google Scholar
  13. 13.
    Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 Working Conference on Engineering Information Systems in the Internet Context. (2002)Google Scholar
  14. 14.
    Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. Technical Report 2000 -36, Computer Science Department, Stanford University, (December 2000), Available at http://dbpubs.stanford.edu/pub/2000-36
  15. 15.
    Zhang, Z., He, B., Chang, C.-C.K.: Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly. In: Proceedings of the 31st Very Large Data Bases Conference (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Manuel Álvarez
    • 1
  • Juan Raposo
    • 1
  • Alberto Pan
    • 1
  • Fidel Cacheda
    • 1
  • Fernando Bellas
    • 1
  • Víctor Carneiro
    • 1
  1. 1.Department of Information and Communications Technologies,University of A Coruña, 15071 A CoruñaSpain

Personalised recommendations