Processing XML Streams with Deterministic Automata
We consider the problem of evaluating a large number of XPath expressions on an XML stream. Our main contribution consists in showing that Deterministic Finite Automata (DFA) can be used effectively for this problem: in our experiments we achieve a throughput of about 5.4MB/s, independent of the number of XPath expressions (up to 1,000,000 in our tests). The major problem we face is that of the size of the DFA. Since the number of states grows exponentially with the number of XPath expressions, it was previously believed that DFAs cannot be used to process large sets of expressions. We make a theoretical analysis of the number of states in the DFA resulting from XPath expressions, and consider both the case when it is constructed eagerly, and when it is constructed lazily. Our analysis indicates that, when the automaton is constructed lazily, and under certain assumptions about the structure of the input XML data, the number of states in the lazy DFA is manageable. We also validate experimentally our findings, on both synthetic and real XML data sets.
Unable to display preview. Download preview PDF.
- 1.S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 1999.Google Scholar
- 3.M. Altinel and M. Franklin. Efficient filtering of XML documents for selective dissemination. In Proceedings of VLDB, pages 53–64, Cairo, Egypt, September 2000.Google Scholar
- 4.I. Avila-Campillo, T. J. Green, A. Gupta, M. Onizuka, D. Raven, and D. Suciu. XMLTK: An XML toolkit for scalable XML stream processing. In Proceedings of PLANX, October 2002.Google Scholar
- 5.P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory, pages 336–350, Delphi, Greece, 1997. Springer Verlag.Google Scholar
- 6.C. Chan, P. Felber, M. Garofalakis, and R. Rastogi. Efficient filtering of XML documents with XPath expressions. In Proceedings of the International Conference on Data Engineering, 2002.Google Scholar
- 7.J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for internet databases. In Proceedings of the ACM/SIGMOD Conference on Management of Data, pages 379–390, 2000.Google Scholar
- 8.T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.Google Scholar
- 9.Y. Diao, P. Fischer, M. Franklin, and R. To. Y filter: Efficient and scalable filtering of xml documents. In Proceedings of the International Conference on Data Engineering, San Jose, California, February 2002.Google Scholar
- 10.M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas. In Proceedings of the International Conference on Data Engineering, pages 14–23, 1998.Google Scholar
- 11.R. Goldman and J. Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In Proceedings of Very Large Data Bases, pages 436–445, September 1997.Google Scholar
- 12.T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing xml streams with deterministic automata. Technical Report 02-10-03, University of Washington, 2002. Available from http://www.cs.washington.edu/homes/suciu.
- 13.D. G. Higgins, R. Fuchs, P. J. Stoehr, and G. N. Cameron. The EMBL data library. Nucleic Acids Research, 20:2071–2074, 1992.Google Scholar
- 14.J. Hopcroft and J. Ullman. Introduction to automata theory, languages, and computation. Addison-Wesley, 1979.Google Scholar
- 15.Z. Ives, A. Halevy, and D. Weld. An XML query engine for network-bound data. Unpublished, 2001.Google Scholar
- 16.H. Liefke and D. Suciu. XMill: an efficent compressor for XML data. In Proceedings of SIGMOD, pages 153–164, Dallas, TX, 2000.Google Scholar
- 17.M. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of English: the Penn Treenbak. Computational Linguistics, 19, 1993.Google Scholar
- 18.J. McHugh and J. Widom. Query optimization for XML. In Proceedings of VLDB, pages 315–326, Edinburgh, UK, September 1999.Google Scholar
- 19.NASA’s astronomical data center. ADC XML resource page. http://xml.gsfc.nasa.gov/.
- 20.B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on the web. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 437–448, Santa Barbara, 2001.Google Scholar
- 21.D. Olteanu, T. Kiesling, and F. Bry. An evaluation of regular path expressions with qualifiers against XML streams. In Proc. the International Conference on Data Engineering, 2003.Google Scholar
- 22.G. Rozenberg and A. Salomaa. Handbook of Formal Languages. Springer Verlag, 1997.Google Scholar
- 23.A. Sahuguet. Everything you ever wanted to know about dtds, but were afraid to ask. In D. Suciu and G. Vossen, editors, Proceedings of WebDB, pages 171–183. Sringer Verlag, 2000.Google Scholar
- 24.A. Snoeren, K. Conley, and D. Gifford. Mesh-based content routing using XML. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001.Google Scholar
- 25.J. Thierry-Mieg and R. Durbin. Syntactic Definitions for the ACEDB Data Base Manager. Technical Report MRC-LMB xx.92, MRC Laboratory for Molecular Biology, Cambridge,CB2 2QH, UK, 1992.Google Scholar