From Searching Text to Querying XML Streams
XML data is queried with XPath expressions, which are a limited form of regular expressions.New XML stream processing applications, such as content-based routing or selective dissemination of information, require thousands or millions of XPath expressions to be evaluated simultaneously on the incoming XML stream at a high, sustained rate.Conceptually, the XPath evaluation problem is analogous to the text search problem, in which one or several regular expressions need to be matched to a given text, but the number of regular expressions here is much larger, while the “text” is much shorter, since it corresponds to the depth of the XML stream.In this paper we examine techniques that have been proposed for XML stream processing, which are variations of either a non-deterministic or a deterministic finite automata (NFA and DFA).For the latter, we describe a series or theoretical results establishing lower and upper bounds on the number of DFA states for sets of XPath expressions.
KeywordsQuery Processing Regular Expression Wild Card XPath Expression Query Tree
Unable to display preview. Download preview PDF.
- M. Altinel and M. Franklin. Efficient filtering of XML documents for selective dissemination.In Proceedings of VLDB, pages 53–64, Cairo, Egipt, September 2000.Google Scholar
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems.In Proceedings of the ACM SIGART/SIGMOD Symposium on Principles of Database Systems, pages 1–16, June 2002.Google Scholar
- C. Chan, P. Felber, M. Garofalakis, and R. Rastogi. Efficient filtering of XML documents with XPath expressions.In Proceedings of the International Conference on Data Engineering, 2002.Google Scholar
- V. Christophides, S.Abiteboul, S.Cluet, and M. Scholl. From structured documents to novel query facilities.In R. Snodgrass and M. Winslett, editors, Proceedings of 1994 ACM SIGMOD International Conference on Management of Data, pages 313–324, Minneapolis, Minnesota, May 1994.Google Scholar
- T.H. Cormen, C. E. Leiserson, and R.L. Rivest. Introduction to Algorithms. MI T Press, 1990.Google Scholar
- R. Goldman and J. Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In Proceedings of Very Large Data Bases, pages 436–445, September 1997.Google Scholar
- G. Gonnet, R. Baeza-Yates, and T. Snider. Lexicographical indices for text: inverted files vs. PAT trees. In W.B. Frakes and R.A. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms, pages 66–82. Prentice-Hall, 1992.Google Scholar
- G. Gonnet and F. Tompa. Mind your grammar: A new approach to modelling text. In Proceedings of 13th International Conference on Very Large Databases, pages 339–346, 1987.Google Scholar
- T.J. Green, A. Gupta, M. Onizuka, and D. Suciu. XMLTK: an XML toolkit for scalable XML stream processing, 2002.manuscript.Google Scholar
- T.J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing xml streams with deterministic automata and stream indexes, 2002. manuscript.Google Scholar
- M. Gudgin, M. Hadley, J. Moreau, and H. Nielsen. SOAP version 1.2 part 1: Messaging framework, 2001. available from the W3C, http://www.w3.org/2000/xp/Group/.
- M. Gudgin, M. Hadley, J. Moreau, and H. Nielsen. SOAP version 1.2 part 2: Adjuncts, 2001.available from the W3C, http://www.w3.org/2000/xp/Group/.
- A. Gupta, A. Halevy, and D. Suciu. View selection for XML stream processing. In WebDB’2000, 2002.Google Scholar
- J. Hopcroft and J. Ullman. Introduction to automata theory, languages, and computation. Addison-Wesley, 1979.Google Scholar
- Z. Ives, A. Halevy, and D. Weld. An XML query engine for network-bound data. Unpublished, 2001.Google Scholar
- H. Liefke and D. Suciu. XMill: an efficent compressor for XML data. In Proceedings of SIGMOD, pages 153–164, Dallas, TX, 2000.Google Scholar
- M. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of English: the Penn Treenbak. Computational Linguistics, 19, 1993.Google Scholar
- NASA’s astronomical data center. ADC XML resource page. http://xml.gsfc.nasa.gov/.
- G. Navarro and R. Baeza-Yates. Proximal nodes: a model to query document databases by content and structure. ACM Transactions on Information Systems, 15(4):400–435, October 1997.Google Scholar
- B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on the web. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 437–448, Santa Barbara, 2001.Google Scholar
- G. Rozenberg and A. Salomaa. Handbook of Formal Languages. Springer Verlag, 1997.Google Scholar
- A. Salminen and F.W. Tompa. Pat expressions: An algebra for text search. In Papers in Computational Lexicography: COMPLEX’92, pages 309–332, 1992.Google Scholar
- A. Salminen and F.W. Tompa. Pat expressions: An algebra for text search. Acta Linguistica Hungarica, 41(1-4):277–306, 1994.Google Scholar
- A. Snoeren, K. Conley, and D. Gifford. Mesh-based content routing using XML. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001.Google Scholar