Skip to main content
Log in

Discovering Interesting Relationships among Deep Web Databases: A Source-Biased Approach

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

The escalation of deep web databases has been phenomenal over the last decade, spawning a growing interest in automated discovery of interesting relationships among available deep web databases. Unlike the “surface” web of static pages, these deep web databases provide data through a web-based query interface and account for a huge portion of all web content. This paper presents a novel source-biased approach to efficiently discover interesting relationships among web-enabled databases on the deep web. Our approach supports a relationship-centric view over a collection of deep web databases through source-biased database analysis and exploration. Our source-biased approach has three unique features: First, we develop source-biased probing techniques, which allow us to determine in very few interactions whether a target database is relevant to the source database by probing the target with very precise probes. Second, we introduce source-biased relevance metrics to evaluate the relevance of deep web databases discovered, to identify interesting types of source-biased relationships for a collection of deep web databases, and to rank them accordingly. The source-biased relationships discovered not only present value-added metadata for each deep web database but can also provide direct support for personalized relationship-centric queries. Third, but not least, we also develop a performance optimization using source-biased probing with focal terms to further improve the effectiveness of the basic source-biased model. A prototype system is designed for crawling, probing, and supporting relationship-centric queries over deep web databases using the source-biased approach. Our experiments evaluate the effectiveness of the proposed source-biased analysis and discovery model, showing that the source-biased approach outperforms query-biased probing and unbiased probing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agichtein, E., Ipeirotis, P., Gravano, L.: Modeling query-based access to text databases. In: Proceedings of the International Workshop on the Web and Databases (WebDB ‘03), San Diego, 2003

  2. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM, Addison-Wesley (1999)

  3. Bergman, M.: The deep web: Surfacing hidden value. BrightPlanet (2000)

  4. Callan, J.P., Connell, M.E.: Query-based sampling of text databases. ACM Trans. Inf. Sys. 19(2), 97–130 (2001)

    Article  Google Scholar 

  5. Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), Seattle, 1995

  6. Callan, J., Connell, M., Du, A.: Automatic discovery of language models for text databases. In: Proceedings of the 1999 ACM Conference on Management of Data (SIGMOD ‘99), Philadelphia, 1999

  7. Caverlee, J., Liu, L., Buttler, D.: Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep web. In: Proceedings of the 20th IEEE International Conference on Data Engineering (ICDE ‘04), Boston, 2004

  8. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. In: Proceedings of the Eighth International World Wide Web Conference (WWW ‘99), May, 1999

  9. Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: Observations and implications. SIGMOD Rec. 33(3) (2004)

  10. Cohen, W.W., Singer, Y.: Learning to query the web. In: AAAI Workshop on Internet-Based Information Systems, 1996

  11. Craswell, N., Bailey, P., Hawking, D.: Server selection on the World Wide Web. In: Proceedings of the Fifth ACM conference on Digital Libraries (ACM DL ‘00), San Antonio, 2000

  12. Dolin, R., Agrawal, D., Abbadi, A.: Scalable collection summarization and selection. In: Proceedings of the Fourth ACM conference on Digital Libraries (ACM DL ‘99), Berkeley, 1999

  13. French, J.C., Powell, A.L., Callan, J.P., Viles, C.L., Emmitt, T., Prey, K.J., Mou, Y.: Comparing the performance of database selection algorithms. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘99), Berkeley, 1999

  14. Fuhr, N.: A decision-theoretic approach to database selection in networked IR. ACM Trans. Inf. Sys. 17(3), 229–249 (1999)

    Article  Google Scholar 

  15. Gravano, L., García-Molina, H.: Generalizing GlOSS to vector-space databases and broker hierarchies. In: Proceedings of the 21st International Conference on Very Large Databases (VLDB ‘95), Zurich, 1995

  16. Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: Text-source discovery over the Internet. ACM Trans. Database Syst. 24(2), 229–264 (1999)

    Article  Google Scholar 

  17. Hawking, D., Thistlewaite, P.: Methods for information server selection. ACM Trans. Inf. Sys. 17(1), 40–76 (1999)

    Article  Google Scholar 

  18. Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the 28th International Conference on Very Large Databases (VLDB ‘02), Hong Kong, 2002

  19. Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden-web databases. In: Proceedings of the 2001 ACM Conference on Management of Data (SIGMOD ‘01), Santa Barbara, 2001

  20. Ipeirotis, P.G., Gravano, L., Sahami, M.: QProber: A system for automatic classification of hidden-web databases. ACM Trans. Inf. Sys. (TOIS) 21(1), 1–41 (2003)

    Article  Google Scholar 

  21. Ipeirotis, P.G., Ntoulas, A., Cho, J., Gravano, L.: Modeling and managing content changes in text databases. In: Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE ‘05), 2005

  22. Liu, L.: Query routing in large-scale digital library systems. In: Proceedings of the 15th IEEE International Conference on Data Engineering (ICDE ‘99), Sydney, 1999

  23. Lyman, P., Varian, H.R.: How much information. http://www.sims.berkeley.edu/how-much-info-2003 (2003)

  24. Meng, W., Liu, K.-L., Yu, C.T., Wang, X., Chang, Y., Rishe, N.: Determining text databases to search in the internet. In: Proceedings of the 24th International Conference on Very Large Databases (VLDB ‘98), New York, 1998

  25. Meng, W., Yu, C.T., Liu, K.-L.: Detection of heterogeneities in a multiple text database environment. In: Proceedings of the Fourth IFCIS International Conference on Cooperative Information Systems (CoopIS ‘99), Edinburgh, 1999

  26. Nie, J.: An information retrieval model based on modal logic. Inf. Process. Manag. 25(5), 477–497 (1989)

    Article  Google Scholar 

  27. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  28. Powell, A.L., French, J.C., Callan, J.P., Connell, M.E., Viles, C.L.: The impact of database selection on distributed searching. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘00), Athens, 2000

  29. ProFusion: http://www.profusion.com

  30. PubMed: http://www.ncbi.nlm.nih.gov/PubMed/

  31. Qiu, Y., Frei, H.-P.: Concept-based query expansion. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘93), Pittsburgh, 1993

  32. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Databases (VLDB ‘01), Rome, 2001

  33. Rocco, D., Caverlee, J., Liu, L., Critchlow, T.: Exploiting the deep web with dynabot: matching, probing, and ranking. In: Poster Proceedings of the 14th International World Wide Web Conference (WWW ‘05), 2005

  34. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. In: Readings in Information Retrieval. Morgan Kauffman, San Francisco, CA, 1997

    Google Scholar 

  35. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. CACM 18(11), 613–620 (1971)

    Google Scholar 

  36. Schutze, H., Pedersen, J.O.: A cooccurrence-based thesaurus and two applications to information retrieval. Inf. Process. Manag. 33(3), 307–318 (1997)

    Article  Google Scholar 

  37. Sugiura, A., Etzioni, O.: Query routing for web search engines: Architecture and experiments. In: Proceedings of the Ninth International World Wide Web Conference (WWW ‘00), Amsterdam, 2000

  38. Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the 30th International Conference on Very Large Databases (VLDB ‘04), Toronto, 2004

  39. Wang, W., Meng, W., Yu, C.: Concept hierarchy based text database categorization in a metasearch engine environment. In: Proceedings of the First International Conference on Web Information Systems Engineering (WISE ‘00), Hong Kong, 2000

  40. Wu, W., Yu, C.T., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM Conference on Management of Data (SIGMOD ‘04), Paris, 2004

  41. Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘96), Zurich, 1996

  42. Yuwono, B., Lee, D.L.: Server ranking for distributed text retrieval systems on the internet. In: Database Systems for Advanced Applications (DASFAA ‘97), Melbourne, 1997

  43. Zhang, Z., He, B., Chang, K.C.-C.: Understanding web query interfaces: Best-effort parsing with hidden syntax. In: Proceedings of the 2004 ACM Conference on Management of Data (SIGMOD ‘04), Paris, 2004

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Caverlee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Caverlee, J., Liu, L. & Rocco, D. Discovering Interesting Relationships among Deep Web Databases: A Source-Biased Approach. World Wide Web 9, 585–622 (2006). https://doi.org/10.1007/s11280-006-0227-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-006-0227-7

Keywords

Navigation