Advertisement

From Searching Text to Querying XML Streams

  • Dan Suciu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2476)

Abstract

XML data is queried with XPath expressions, which are a limited form of regular expressions.New XML stream processing applications, such as content-based routing or selective dissemination of information, require thousands or millions of XPath expressions to be evaluated simultaneously on the incoming XML stream at a high, sustained rate.Conceptually, the XPath evaluation problem is analogous to the text search problem, in which one or several regular expressions need to be matched to a given text, but the number of regular expressions here is much larger, while the “text” is much shorter, since it corresponds to the depth of the XML stream.In this paper we examine techniques that have been proposed for XML stream processing, which are variations of either a non-deterministic or a deterministic finite automata (NFA and DFA).For the latter, we describe a series or theoretical results establishing lower and upper bounds on the number of DFA states for sets of XPath expressions.

Keywords

Query Processing Regular Expression Wild Card XPath Expression Query Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    A. Aho and M. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18:333–340, 1975.zbMATHCrossRefMathSciNetGoogle Scholar
  2. [2]
    M. Altinel and M. Franklin. Efficient filtering of XML documents for selective dissemination.In Proceedings of VLDB, pages 53–64, Cairo, Egipt, September 2000.Google Scholar
  3. [3]
    B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems.In Proceedings of the ACM SIGART/SIGMOD Symposium on Principles of Database Systems, pages 1–16, June 2002.Google Scholar
  4. [4]
    C. Chan, P. Felber, M. Garofalakis, and R. Rastogi. Efficient filtering of XML documents with XPath expressions.In Proceedings of the International Conference on Data Engineering, 2002.Google Scholar
  5. [5]
    V. Christophides, S.Abiteboul, S.Cluet, and M. Scholl. From structured documents to novel query facilities.In R. Snodgrass and M. Winslett, editors, Proceedings of 1994 ACM SIGMOD International Conference on Management of Data, pages 313–324, Minneapolis, Minnesota, May 1994.Google Scholar
  6. [6]
    T.H. Cormen, C. E. Leiserson, and R.L. Rivest. Introduction to Algorithms. MI T Press, 1990.Google Scholar
  7. [7]
    R. Goldman and J. Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In Proceedings of Very Large Data Bases, pages 436–445, September 1997.Google Scholar
  8. [8]
    G. Gonnet, R. Baeza-Yates, and T. Snider. Lexicographical indices for text: inverted files vs. PAT trees. In W.B. Frakes and R.A. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms, pages 66–82. Prentice-Hall, 1992.Google Scholar
  9. [9]
    G. Gonnet and F. Tompa. Mind your grammar: A new approach to modelling text. In Proceedings of 13th International Conference on Very Large Databases, pages 339–346, 1987.Google Scholar
  10. [10]
    T.J. Green, A. Gupta, M. Onizuka, and D. Suciu. XMLTK: an XML toolkit for scalable XML stream processing, 2002.manuscript.Google Scholar
  11. [11]
    T.J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing xml streams with deterministic automata and stream indexes, 2002. manuscript.Google Scholar
  12. [12]
    M. Gudgin, M. Hadley, J. Moreau, and H. Nielsen. SOAP version 1.2 part 1: Messaging framework, 2001. available from the W3C, http://www.w3.org/2000/xp/Group/.
  13. [13]
    M. Gudgin, M. Hadley, J. Moreau, and H. Nielsen. SOAP version 1.2 part 2: Adjuncts, 2001.available from the W3C, http://www.w3.org/2000/xp/Group/.
  14. [14]
    A. Gupta, A. Halevy, and D. Suciu. View selection for XML stream processing. In WebDB’2000, 2002.Google Scholar
  15. [15]
    J. Hopcroft and J. Ullman. Introduction to automata theory, languages, and computation. Addison-Wesley, 1979.Google Scholar
  16. [16]
    Z. Ives, A. Halevy, and D. Weld. An XML query engine for network-bound data. Unpublished, 2001.Google Scholar
  17. [17]
    H. Liefke and D. Suciu. XMill: an efficent compressor for XML data. In Proceedings of SIGMOD, pages 153–164, Dallas, TX, 2000.Google Scholar
  18. [18]
    M. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of English: the Penn Treenbak. Computational Linguistics, 19, 1993.Google Scholar
  19. [19]
    NASA’s astronomical data center. ADC XML resource page. http://xml.gsfc.nasa.gov/.
  20. [20]
    G. Navarro and R. Baeza-Yates. Proximal nodes: a model to query document databases by content and structure. ACM Transactions on Information Systems, 15(4):400–435, October 1997.Google Scholar
  21. [21]
    B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on the web. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 437–448, Santa Barbara, 2001.Google Scholar
  22. [22]
    G. Rozenberg and A. Salomaa. Handbook of Formal Languages. Springer Verlag, 1997.Google Scholar
  23. [23]
    A. Salminen and F.W. Tompa. Pat expressions: An algebra for text search. In Papers in Computational Lexicography: COMPLEX’92, pages 309–332, 1992.Google Scholar
  24. [24]
    A. Salminen and F.W. Tompa. Pat expressions: An algebra for text search. Acta Linguistica Hungarica, 41(1-4):277–306, 1994.Google Scholar
  25. [25]
    A. Snoeren, K. Conley, and D. Gifford. Mesh-based content routing using XML. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Dan Suciu
    • 1
  1. 1.University of WashingtonUSA

Personalised recommendations