Processing XML Streams with Deterministic Automata

  • Todd J. Green
  • Gerome Miklau
  • Makoto Onizuka
  • Dan Suciu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2572)

Abstract

We consider the problem of evaluating a large number of XPath expressions on an XML stream. Our main contribution consists in showing that Deterministic Finite Automata (DFA) can be used effectively for this problem: in our experiments we achieve a throughput of about 5.4MB/s, independent of the number of XPath expressions (up to 1,000,000 in our tests). The major problem we face is that of the size of the DFA. Since the number of states grows exponentially with the number of XPath expressions, it was previously believed that DFAs cannot be used to process large sets of expressions. We make a theoretical analysis of the number of states in the DFA resulting from XPath expressions, and consider both the case when it is constructed eagerly, and when it is constructed lazily. Our analysis indicates that, when the automaton is constructed lazily, and under certain assumptions about the structure of the input XML data, the number of states in the lazy DFA is manageable. We also validate experimentally our findings, on both synthetic and real XML data sets.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 1999.Google Scholar
  2. 2.
    A. Aho and M. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18:333–340, 1975.MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    M. Altinel and M. Franklin. Efficient filtering of XML documents for selective dissemination. In Proceedings of VLDB, pages 53–64, Cairo, Egypt, September 2000.Google Scholar
  4. 4.
    I. Avila-Campillo, T. J. Green, A. Gupta, M. Onizuka, D. Raven, and D. Suciu. XMLTK: An XML toolkit for scalable XML stream processing. In Proceedings of PLANX, October 2002.Google Scholar
  5. 5.
    P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory, pages 336–350, Delphi, Greece, 1997. Springer Verlag.Google Scholar
  6. 6.
    C. Chan, P. Felber, M. Garofalakis, and R. Rastogi. Efficient filtering of XML documents with XPath expressions. In Proceedings of the International Conference on Data Engineering, 2002.Google Scholar
  7. 7.
    J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for internet databases. In Proceedings of the ACM/SIGMOD Conference on Management of Data, pages 379–390, 2000.Google Scholar
  8. 8.
    T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.Google Scholar
  9. 9.
    Y. Diao, P. Fischer, M. Franklin, and R. To. Y filter: Efficient and scalable filtering of xml documents. In Proceedings of the International Conference on Data Engineering, San Jose, California, February 2002.Google Scholar
  10. 10.
    M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas. In Proceedings of the International Conference on Data Engineering, pages 14–23, 1998.Google Scholar
  11. 11.
    R. Goldman and J. Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In Proceedings of Very Large Data Bases, pages 436–445, September 1997.Google Scholar
  12. 12.
    T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing xml streams with deterministic automata. Technical Report 02-10-03, University of Washington, 2002. Available from http://www.cs.washington.edu/homes/suciu.
  13. 13.
    D. G. Higgins, R. Fuchs, P. J. Stoehr, and G. N. Cameron. The EMBL data library. Nucleic Acids Research, 20:2071–2074, 1992.Google Scholar
  14. 14.
    J. Hopcroft and J. Ullman. Introduction to automata theory, languages, and computation. Addison-Wesley, 1979.Google Scholar
  15. 15.
    Z. Ives, A. Halevy, and D. Weld. An XML query engine for network-bound data. Unpublished, 2001.Google Scholar
  16. 16.
    H. Liefke and D. Suciu. XMill: an efficent compressor for XML data. In Proceedings of SIGMOD, pages 153–164, Dallas, TX, 2000.Google Scholar
  17. 17.
    M. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of English: the Penn Treenbak. Computational Linguistics, 19, 1993.Google Scholar
  18. 18.
    J. McHugh and J. Widom. Query optimization for XML. In Proceedings of VLDB, pages 315–326, Edinburgh, UK, September 1999.Google Scholar
  19. 19.
    NASA’s astronomical data center. ADC XML resource page. http://xml.gsfc.nasa.gov/.
  20. 20.
    B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on the web. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 437–448, Santa Barbara, 2001.Google Scholar
  21. 21.
    D. Olteanu, T. Kiesling, and F. Bry. An evaluation of regular path expressions with qualifiers against XML streams. In Proc. the International Conference on Data Engineering, 2003.Google Scholar
  22. 22.
    G. Rozenberg and A. Salomaa. Handbook of Formal Languages. Springer Verlag, 1997.Google Scholar
  23. 23.
    A. Sahuguet. Everything you ever wanted to know about dtds, but were afraid to ask. In D. Suciu and G. Vossen, editors, Proceedings of WebDB, pages 171–183. Sringer Verlag, 2000.Google Scholar
  24. 24.
    A. Snoeren, K. Conley, and D. Gifford. Mesh-based content routing using XML. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001.Google Scholar
  25. 25.
    J. Thierry-Mieg and R. Durbin. Syntactic Definitions for the ACEDB Data Base Manager. Technical Report MRC-LMB xx.92, MRC Laboratory for Molecular Biology, Cambridge,CB2 2QH, UK, 1992.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Todd J. Green
    • 1
  • Gerome Miklau
    • 2
  • Makoto Onizuka
    • 3
  • Dan Suciu
    • 2
  1. 1.Xyleme SASaint-CloudFrance
  2. 2.Department of Computer ScienceUniversity of WashingtonWashington
  3. 3.NTT Cyber Space LaboratoriesNTT CorporationUSA

Personalised recommendations