WrapIt: Automated Integration of Web Databases with Extensional Overlaps
The world wide web does not longer consist of static web pages. Instead, more and more web pages are created dynamically from user request and database content. Conventional search engines do not consider these dynamic pages, as user input cannot be simulated, thus providing often insufficient results.
A new approach for online integration of web databases will be presented in this paper. Providing only one sample HTML result page for a source, result pages for new requests will be found by structural recognition. Once structural recognition is established for one source, other web databases of the same universe (e.g. movie databases) can be integrated on the fly by content-based recognition. Thus, the user receives results from various sources.
Global schemata will not be produced at all. Instead, the heterogeneity of the single sources will be preserved. The only requirement is given by the existence of an extensional overlap of the databases.
Unable to display preview. Download preview PDF.
- [BFG01]Robert Baumgartner, Sergio Flesca, and Georg Gottlob. Declarative information extraction,Web crawling, and recursive wrapping with lixto. Lecture Notes in Computer Science, 2173, 2001.Google Scholar
- [CMM01]Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’ 01), pages 109–118, Orlando, September 2001. Morgan Kaufmann.Google Scholar
- [Coh98]William W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the 1998 ACM SIGMOD, Seattle, Washington, 1998.Google Scholar
- [MN01]Janet L. Wiener Marc Najork. Breadth-first search crawling yields highquality pages. In Proceedings of Tenth International World Wide Web Conference, Hong Kong, May 2001.Google Scholar
- [RGM01]Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’ 01), pages 129–138, Orlando, September 2001. Morgan Kaufmann.Google Scholar
- [SA99]Arnaud Sahuguet and Fabien Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’ 99), 1999.Google Scholar
- [Sal89]Gerald Salton, editor. Automatic Text Processing. Addison-Wesley, Reading, Massachusetts, 1989.Google Scholar