Building HyperView Wrappers for Publisher Web Sites
Electronic journals are becoming a major source of scientific information. Researchers interested only in certain topics do not have time to scan all possibly relevant journals on a regular basis. A digital library can assist them by providing a uniform, search-able interface for electronic journals. To this purpose, a catalogue of metadata on the available journals such as authors and titles of articles must be established by the digital library. If there is no cooperation with journal publishers, this metadata must be extracted from the publishers’ Web Sites, overcoming the intrinsic heterogeneity problems.
Within the framework of the ongoing Natural Sciences Digital Library project at the Free University of Berlin, we have designed a wrapper-mediator mechanism that copes with the heterogeneity problems of automatic metadata acquisition. It is based on our generic HyperView methodology for integration ofWeb Sites. From this methodology it inherits two elegant and effective features. First, the structure of the publisher site is specified with abstract graph-schemata, instead of being hard-coded in scripts for data acquisition. Second, a powerful view concept based on declarative graph-transformation rules is used for information extraction.
- 1.B. Adelberg. NoDoSE: A tool for semi-automatically extracting semi-structured data from text documents-brad adelberg. In SIGMOD Conference 1998, 1998.Google Scholar
- 2.G. Arocena and A. Mendelzon. WebOQL: Restructuring documents, databases and webs. In Proc. of 14th. Intl. Conf. on Data Engineering (ICDE 98), 1998.Google Scholar
- 3.N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. In Proc. Workshop on Management of Semistructured Data, Tucson, 1997.Google Scholar
- 4.P. Atzeni and G. Mecca. Cut & paste. In PODS’97, pages 12–15, Tucson, Arizona, 1997.Google Scholar
- 6.BUBL (British National Information Service for the higher education community). http://bubl.ac.uk/admin/purpose.htm.
- 7.S. Cluet, C. Delobel, J. Siméon, and K. Smaga. Your mediators need data conversion! In SIGMOD Conference 1998, pages 177–188, 1998.Google Scholar
- 8.M. Dreger et al. Medoc information broker-harnessing the information in leterature and full text databases. In N. Fuhr J. Callan, editor, Proc. SIGIR workshop on Networked Information Retrieval, 1996.Google Scholar
- 9.D. Faensen, A. Hinze, and H. Schweppe. Alerting in a digital library environment-do channels meet the requirements. In ECDL’98, 1998.Google Scholar
- 11.Lukas C. Faulstich. Integrating web sites using HyperView. Submitted for publication., 1998.Google Scholar
- 12.Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon Levy, and Dan Suciu. Catching the boat with Strudel: experiences with a web-site management system. In SIGMOD, pages 414–425, 1998.Google Scholar
- 13.H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. Integrating and accessing heterogeneous information sources in TSIM-MIS. In AAAI Symposium on Information Gathering, pages 61–64, 1995.Google Scholar
- 14.A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques and applications. IEEE Quarterly Bulletin on Data Engineering; Special Issue on Materialized Views and Data Warehousing, 18(2):3–18, 1995.Google Scholar
- 15.J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Proc. Workshop on Management of Semistructured Data, Tucson, 1997.Google Scholar
- 16.JSTOR. http://www.jstor.org/.
- 17.D. Konopnicki and O. Shmueli. W3QS: A system for WWW querying. In ICDE’97, pages 586–586, April 1997.Google Scholar
- 18.L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. A declarative language for querying and restructuring the Web. In IEEE, editor, RIDE’96, pages 12–21. IEEE Computer Society Press, 1996.Google Scholar
- 19.M. Ley. Die Trierer Informatik-Bibliographie DBLP. In GI Jahrestagung 1997, pages 257–266, 1997. http://dblp.uni-trier.de.
- 21.A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the World Wide Web. International Journal on Digital Libraries, 1(1):54–67, 1997.Google Scholar
- 22.P. Merialdo P. Atzeni, G. Mecca. To weave the web. In VLDB’ 97, pages 206–215, 1997.Google Scholar
- 23.PHP3 manual. http://www.php.net/manual/, 1998.
- 24.B.R. Schatz, W.H. Mischo, T.W. Cole, J.B. Hardin, A.P. Bishop, and H. Chen. Federating diverse collections of scientific literature. IEEE Computer, 29(5), 1996.Google Scholar
- 25.Simon Fraser University Electronic Library in Computing Science. http://fas.sfu.ca/projects/ElectronicLibrary/Collections/CMPT/.
- 26.D. Smith and M. Lopez. Information extraction for semi-structured documents. In Proc. Workshop on Management of Semistructured Data, Tucson, 1997.Google Scholar
- 27.Stanford University Libraries-Electronic Journals Collection. http://www-sul.stanford.edu/collect/ejourns.html.
- 28.Elektronische Zeitschriftenbibliothek, Universität Regensburg. http://www.bibliothek.uniegensburg.de/ezeit/ezb.phtml.
- 29.Stony Brook University Libraries-electronic journals. http://www.sunysb.edu/library/ldeljour.htm.
- 30.J. Widom. Research problems in data warehousing. In 4th International Conference on Information and Knowledge Management, pages 25–30, 1995.Google Scholar