Information extraction and database techniques: A user-oriented approach to querying the web

  • Zoé Lacroix
  • Arnaud Sahuguet
  • Raman Chandrasekar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1413)


We propose a novel approach to querying the Web with a system named AKIRA (Agentive Knowledge-based Information Retrieval Architecture) which combines advanced technologies from Information Retrieval and Extraction together with Database techniques. The former enable the system to access the explicit as well as the implicit structure of Web documents and organize them into a hierarchy of concepts and metaconcepts; the latter provide tools for data-manipulation. We propose a useroriented approach: given the user's query, AKIRA extracts a target structure (structure expressed in the query) and uses standard retrieval techniques to access potentially relevant documents. The content of these documents is processed using extraction techniques (along with a flexible agentive structure) to filter for relevance and to extract from them implicit or explicit structure matching the target structure. The information garnered is used to populate a smart-cache (an object-oriented database) whose schema is inferred from the target structure. This smart-cache, whose schema is thus defined a posteriori, is populated and queried with an expression of PIQL, our query language. AKIRA integrates these complementary techniques to provide maximum flexibility to the user and offer transparent access to the content of Web documents.


Web data model query language information retrieval & extraction agents cache view 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [ACC+97]
    S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and J. Siméon. Querying documents in object databases. Journal on Digital Libraries, 1997.Google Scholar
  2. [AHV95]
    S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.Google Scholar
  3. [AI97]
    D.E. Appelt and D. Israel. Building information extraction systems. In ANLP-97 Tutorial, Washington, D.C., March 1997.Google Scholar
  4. [AK89]
    S. Abiteboul and P. Kanellakis. Object Identity As A Query Language Primitive. In ACM SIGMOD Symposium on the management of Data, pages 159–173, Portland Oregon USA, June 1989.Google Scholar
  5. [AM98]
    G. Arocena and A. Mendelzon. WebOQL: Restructuring Documents, Databases and Webs. In Proceedings of the International Conference on Data Engineering, Orlando, February 1998.Google Scholar
  6. [AMM97]
    P. Atzeni, G. Mecca, and P. Merialdo. To Weave the Web. In Proc. of Intl. Conf. on Very Large Data Bases, Athens, Greece, August 1997.Google Scholar
  7. [AQM+97]
    S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J.L. Wiener. The Lorel Query Language for Semistructured Data. Journal on Digital Libraries, 1997. Scholar
  8. [ART95]
    I. Androutsopoulos, G.D. Ritchie, and P. Thanisch. Natural language interfaces to databases — an introduction. Journal of Natural Language Engineering, 1(1):29–81, 1995. Cambridge University Press. ion/ Scholar
  9. [ART97]
    I. Androutsopoulos, G.D. Ritchie, and P. Thanisch. A framework for natural language interfaces to temporal databases. In Proceedings of the 20th Australasian Computer Science Conference, volume 19(1), pages 307–315, Sydney, Australia, 1997. Australian Computer Science Communications. ion/ Scholar
  10. [AV97]
    S. Abiteboul and V. Vianu. Regular Path Queries with Constraints. In Proc. ACM Symp. on Principles of Database Systems, 1997.Google Scholar
  11. [Ba97]
    D. Bartels and al. The Object Database Standard: ODMG 2.0. Morgan Kaufmann, San Francisco, 1997.Google Scholar
  12. [BDR+97]
    B. Baldwin, C. Doran, J.C. Reynar, B. Srinivas, M. Niv, and M. Wasson. EAGLE: An Extensible Architecture for General Linguistic Engineering. In In Proceedings of RIAO'97, Montreal, June 1997.Google Scholar
  13. [CCM96]
    V. Christophides, S. Cluet, and G. Moerkotte. Evaluating Queries with Generalized Path Expressions. In Proc. ACM SIGMOD Symp. on the Management of Data, 1996.Google Scholar
  14. [CS97]
    R. Chandrasekar and B. Srinivas. Using Syntactic Information in Document Filtering: A Comparative Study of Part-of-speech Tagging and Supertagging. In In Proceedings of RIAO'97, Montreal, June 1997.Google Scholar
  15. [dSDA94]
    C. Souza dos Santos, C. Delobel, and S. Abiteboul. Virtual Schemas and Bases. In Proceedings of the International Conference on Extending Database Technology, March 1994.Google Scholar
  16. [FFK+97]
    M. Fernandez, D. Florescu, J. Kang, A. Levy, and D. Suciu. STRUDEL: A Web-site Management System. In ACM SIGMODResearch prototype demonstration, Tucson, Arizona, May 1997.Google Scholar
  17. [FFLS97]
    M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A Query Language and Processor for a Web-Site Management System. In ACM SIGMOD Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997.Google Scholar
  18. [Gri96]
    R. Grishman. TIPSTER Text Phase II Architecture Design. Technical report, TIPSTER Text Program, 1996. Scholar
  19. [GW97]
    R. Goldman and J. Widom. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In Proc. of Intl. Conf. on Very Large Data Bases, Delphi, Greece, August 1997. to appear.Google Scholar
  20. [JS94]
    A.K. Joshi and B. Srinivas. Disambiguation of Super Parts of Speech (or Supertags): Almost Parsing. In Proceedings of the 17th International Conference on Computational Linguistics (COLING '94), Kyoto, Japan, August 1994.Google Scholar
  21. [KS95]
    D. Konopnicki and O. Shmueli. W3QL; A query system for the World Wide Web. In Proc. of Intl. Conf. on Very Large Data Bases, 1995.Google Scholar
  22. [LDB97]
    Z. Lacroix, C. Delobel, and Ph. Brèche. Object Views and Database Restructuring. In Proc. of Intl. Workshop on Database Programming Languages, August 1997.Google Scholar
  23. [LSC98]
    Z. Lacroix, A. Sahuguet, and R. Chandrasekar. User-oriented smart-cache for the Web: What You Seek is What You Get! In ACM SIGMOD — Research prototype demonstration, Seattle, Washington, USA, June 1998. Scholar
  24. [MMM97]
    A. Mendelzon, G. Mihaila, and T. Milo. Querying the World Wide Web. Journal on Digital Libraries, 1(1):54–67, 1997.MATHGoogle Scholar
  25. [RC93]
    S. Ramani and R. Chandrasekar. Glean: a tool for Automated Information Acquisition and Maintenance. Technical report, NCST Bombay, 1993.Google Scholar
  26. [SLE97]
    J. Shakes, M. Langheinrich, and O. Etzioni. Dynamic reference sifting: A case study in the homepage domain. In Proceedings of the Sixth International World Wide Web Conference, pp.189–200, (1997), 1997.Google Scholar
  27. [VDH97]
    A-M. Vercoustre, J. Dell'Oro, and B. Hills. Reuse of Information through virtual documents. In Proceedings of the 2nd Australian Document Computing Symposium, Melbourne, Australia, April 1997.Google Scholar
  28. [Woo97]
    W.A. Woods. Conceptual indexing: A better way to organize knowledge. Technical Report TR-97-61, Sun Microsystems Laboratories, April 1997.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Zoé Lacroix
    • 1
  • Arnaud Sahuguet
    • 2
  • Raman Chandrasekar
    • 3
  1. 1.Institute for Research in Cognitive ScienceUniversity of PennsylvaniaPhiladelphiaUSA
  2. 2.Department of Computer and Information ScienceUniversity of PennsylvaniaPhiladelphiaUSA
  3. 3.Institute for Research in Cognitive Science & Center for the Advanced Study of IndiaPhiladelphiaUSA

Personalised recommendations