Building HyperView Wrappers for Publisher Web Sites

  • Lukas C. Faulstich
  • Myra Spiliopoulou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1513)

Abstract

Electronic journals are becoming a major source of scientific information. Researchers interested only in certain topics do not have time to scan all possibly relevant journals on a regular basis. A digital library can assist them by providing a uniform, search-able interface for electronic journals. To this purpose, a catalogue of metadata on the available journals such as authors and titles of articles must be established by the digital library. If there is no cooperation with journal publishers, this metadata must be extracted from the publishers’ Web Sites, overcoming the intrinsic heterogeneity problems.

Within the framework of the ongoing Natural Sciences Digital Library project at the Free University of Berlin, we have designed a wrapper-mediator mechanism that copes with the heterogeneity problems of automatic metadata acquisition. It is based on our generic HyperView methodology for integration ofWeb Sites. From this methodology it inherits two elegant and effective features. First, the structure of the publisher site is specified with abstract graph-schemata, instead of being hard-coded in scripts for data acquisition. Second, a powerful view concept based on declarative graph-transformation rules is used for information extraction.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    B. Adelberg. NoDoSE: A tool for semi-automatically extracting semi-structured data from text documents-brad adelberg. In SIGMOD Conference 1998, 1998.Google Scholar
  2. 2.
    G. Arocena and A. Mendelzon. WebOQL: Restructuring documents, databases and webs. In Proc. of 14th. Intl. Conf. on Data Engineering (ICDE 98), 1998.Google Scholar
  3. 3.
    N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. In Proc. Workshop on Management of Semistructured Data, Tucson, 1997.Google Scholar
  4. 4.
    P. Atzeni and G. Mecca. Cut & paste. In PODS’97, pages 12–15, Tucson, Arizona, 1997.Google Scholar
  5. 5.
    M. Baldonado, C.K. Chang, L. Gravano, and A. Paepcke. The stanford digital library metadata architecture. International Journal on Digital Libraries, 1(2):108–121, 1997.MATHCrossRefGoogle Scholar
  6. 6.
    BUBL (British National Information Service for the higher education community). http://bubl.ac.uk/admin/purpose.htm.
  7. 7.
    S. Cluet, C. Delobel, J. Siméon, and K. Smaga. Your mediators need data conversion! In SIGMOD Conference 1998, pages 177–188, 1998.Google Scholar
  8. 8.
    M. Dreger et al. Medoc information broker-harnessing the information in leterature and full text databases. In N. Fuhr J. Callan, editor, Proc. SIGIR workshop on Networked Information Retrieval, 1996.Google Scholar
  9. 9.
    D. Faensen, A. Hinze, and H. Schweppe. Alerting in a digital library environment-do channels meet the requirements. In ECDL’98, 1998.Google Scholar
  10. 10.
    L.C. Faulstich, M. Spiliopoulou, and V. Linnemann. WIND: A warehouse for internet data. In Advances in Databases-Proceedings BNCOD 15, number 1271 in LNCS, pages 169–183. Springer, 1997.Google Scholar
  11. 11.
    Lukas C. Faulstich. Integrating web sites using HyperView. Submitted for publication., 1998.Google Scholar
  12. 12.
    Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon Levy, and Dan Suciu. Catching the boat with Strudel: experiences with a web-site management system. In SIGMOD, pages 414–425, 1998.Google Scholar
  13. 13.
    H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. Integrating and accessing heterogeneous information sources in TSIM-MIS. In AAAI Symposium on Information Gathering, pages 61–64, 1995.Google Scholar
  14. 14.
    A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques and applications. IEEE Quarterly Bulletin on Data Engineering; Special Issue on Materialized Views and Data Warehousing, 18(2):3–18, 1995.Google Scholar
  15. 15.
    J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Proc. Workshop on Management of Semistructured Data, Tucson, 1997.Google Scholar
  16. 16.
  17. 17.
    D. Konopnicki and O. Shmueli. W3QS: A system for WWW querying. In ICDE’97, pages 586–586, April 1997.Google Scholar
  18. 18.
    L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. A declarative language for querying and restructuring the Web. In IEEE, editor, RIDE’96, pages 12–21. IEEE Computer Society Press, 1996.Google Scholar
  19. 19.
    M. Ley. Die Trierer Informatik-Bibliographie DBLP. In GI Jahrestagung 1997, pages 257–266, 1997. http://dblp.uni-trier.de.
  20. 20.
    C.A. Lynch. The Z39-50 information retrieval protocol: An overview and status report. ACM Computer Communication Review, 21(1):58–70, 1991.CrossRefGoogle Scholar
  21. 21.
    A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the World Wide Web. International Journal on Digital Libraries, 1(1):54–67, 1997.Google Scholar
  22. 22.
    P. Merialdo P. Atzeni, G. Mecca. To weave the web. In VLDB’ 97, pages 206–215, 1997.Google Scholar
  23. 23.
    PHP3 manual. http://www.php.net/manual/, 1998.
  24. 24.
    B.R. Schatz, W.H. Mischo, T.W. Cole, J.B. Hardin, A.P. Bishop, and H. Chen. Federating diverse collections of scientific literature. IEEE Computer, 29(5), 1996.Google Scholar
  25. 25.
    Simon Fraser University Electronic Library in Computing Science. http://fas.sfu.ca/projects/ElectronicLibrary/Collections/CMPT/.
  26. 26.
    D. Smith and M. Lopez. Information extraction for semi-structured documents. In Proc. Workshop on Management of Semistructured Data, Tucson, 1997.Google Scholar
  27. 27.
    Stanford University Libraries-Electronic Journals Collection. http://www-sul.stanford.edu/collect/ejourns.html.
  28. 28.
    Elektronische Zeitschriftenbibliothek, Universität Regensburg. http://www.bibliothek.uniegensburg.de/ezeit/ezb.phtml.
  29. 29.
    Stony Brook University Libraries-electronic journals. http://www.sunysb.edu/library/ldeljour.htm.
  30. 30.
    J. Widom. Research problems in data warehousing. In 4th International Conference on Information and Knowledge Management, pages 25–30, 1995.Google Scholar

Copyright information

© Springer-VerlagBerlin Heidelberg 1998

Authors and Affiliations

  • Lukas C. Faulstich
    • 1
  • Myra Spiliopoulou
    • 2
  1. 1.Institut für InformatikFreie Universität BerlinGermany
  2. 2.Institut für WirtschaftsinformatikHumboldt-Universität zu BerlinGermany

Personalised recommendations