Integrating Deep-Web Information Sources

  • Iñaki Fernández de Viana
  • Inma Hernandez
  • Patricia Jiménez
  • Carlos R. Rivero
  • Hassan A. Sleiman
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 71)


Deep-web information sources are difficult to integrate into automated business processes if they only provide a search form. A wrapping agent is a piece of software that allows a developer to query such information sources without worrying about the details of interacting with such forms. Our goal is to help software engineers construct wrapping agents that interpret queries written in high-level structured languages.We think that this shall definitely help reduce integration costs because this shall relieve developers from the burden of transforming their queries into low-level interactions in an ad-hoc manner. In this paper, we report on our reference framework, delve into the related work, and highlight current research challenges. This is intended to help guide future research efforts in this area.


Information web integration 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Anupam, V., et al.: Automating web navigation with the webvcr. Computer Networks 33(1-6) (2000)Google Scholar
  2. 2.
    Baumgartner, R., et al.: Deep web navigation in web data extraction. In: CIMCA/IAWTIC (2005)Google Scholar
  3. 3.
    Blanco, L., et al.: Efficiently locating collections of web pages to wrap. In: WEBIST (2005)Google Scholar
  4. 4.
    Blythe, J., et al.: Information integration for the masses. J. UCS 14(11) (2008)Google Scholar
  5. 5.
    Chang, C.-H., et al.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10) (2006)Google Scholar
  6. 6.
    Chang, K.C.-C., et al.: Toward large scale integration: Building a metaquerier over databases on the web. In: CIDR (2005)Google Scholar
  7. 7.
    Chidlovskii, B., et al.: Documentum eci self-repairing wrappers: performance analysis. In: SIGMOD Conference (2006)Google Scholar
  8. 8.
    Crescenzi, V., et al.: Roadrunner: Towards automatic data extraction from large web sites (2001)Google Scholar
  9. 9.
    Davulcu, H., et al.: A layered architecture for querying dynamic web content. In: SIGMOD Conference (1999)Google Scholar
  10. 10.
    Halevy, A.Y., et al.: Answering queries using views: A survey. VLDB J. 10(4) (2001)Google Scholar
  11. 11.
    He, H., et al.: Towards deeper understanding of the search interfaces of the deep web. World Wide Web (2007)Google Scholar
  12. 12.
    Hogue, A., Karger, D.R.: Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW (2005)Google Scholar
  13. 13.
    Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23(8) (1998)Google Scholar
  14. 14.
    Jung, K., et al.: Text information extraction in images and video: a survey. Pattern Recognition 37(5) (2004)Google Scholar
  15. 15.
    Kushmerick, N., et al.: Regression testing for wrapper maintenance. In: AAAI/IAAI (1999)Google Scholar
  16. 16.
    Kushmerick, N., et al.: Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2) (2000)Google Scholar
  17. 17.
    Kushmerick, N., et al.: Wrapper verification. World Wide Web 3(2) (2000)Google Scholar
  18. 18.
    Laender, A.H.F., et al.: A brief survey of web data extraction tools. SIGMOD Record 31(2) (2002)Google Scholar
  19. 19.
    Lage, J.P., et al.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2) (2004)Google Scholar
  20. 20.
    Lerman, K., et al.: Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research 18 (2003)Google Scholar
  21. 21.
    Liddle, S.W., et al.: Extracting data behind web forms. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503. Springer, Heidelberg (2002)Google Scholar
  22. 22.
    Liu, B., et al.: Mining web pages for data records. IEEE Intelligent Systems 19(6) (2004)Google Scholar
  23. 23.
    Madhavan, J., et al.: Harnessing the deep web: Present and future. In: CIDR (2009)Google Scholar
  24. 24.
    McCann, R., et al.: Mapping maintenance for data integration systems. In: VLDB (2005)Google Scholar
  25. 25.
    Montoto, P., et al.: A workflow language for web automation. J. UCS 14(11) (2008)Google Scholar
  26. 26.
    Pan, A., et al.: A model for advanced query capability description in mediator systems. In: ICEIS (2002)Google Scholar
  27. 27.
    Petropoulos, M., et al.: Exporting and interactively querying web service-accessed sources: The clide system. ACM Trans. Database Syst. 32(4) (2007)Google Scholar
  28. 28.
    Quinlan, J.R., et al.: Learning first-order definitions of functions. J. Artif. Intell. Res. (JAIR) 5 (1996)Google Scholar
  29. 29.
    Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: VLDB (2001)Google Scholar
  30. 30.
    Rivero, C., et al.: From queries to search forms: an implementation. IJCAT 33(4) (2008)Google Scholar
  31. 31.
    Shu, L., et al.: Querying capability modeling and construction of deep web sources. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds.) WISE 2007. LNCS, vol. 4831, pp. 13–25. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  32. 32.
    Tax, D.M.J., et al.: One-class classification, concept learning in the absence of counter example. PhD thesis, Delft University of Technology (2001)Google Scholar
  33. 33.
    Vidal, M.L.A., et al.: Structure-based crawling in the hidden web. J. UCS 14(11) (2008)Google Scholar
  34. 34.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (1999)Google Scholar
  35. 35.
    Wong, T.-L., Lam, W.: Adapting web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Techn. 7(1) (2007)Google Scholar
  36. 36.
    Zhang, Z., et al.: Understanding web query interfaces: Best-effort parsing with hidden syntax. In: SIGMOD Conference (2004)Google Scholar
  37. 37.
    Zhang, Z., et al.: Light-weight domain-based form assistant: Querying web databases on the fly. In: VLDB (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Iñaki Fernández de Viana
    • 1
  • Inma Hernandez
    • 2
  • Patricia Jiménez
    • 1
  • Carlos R. Rivero
    • 2
  • Hassan A. Sleiman
    • 2
  1. 1.University of Huelva 
  2. 2.University of Sevilla 

Personalised recommendations