Advertisement

Online Dictionary Matching for Streams of XML Documents

  • Panu Silvasti
  • Seppo Sippu
  • Eljas Soisalon-Soininen
Conference paper
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 323)

Abstract

We consider the online multiple-pattern matching problem for streams of XML documents, when the patterns are expressed as linear XPath expressions containing child operators (/), descendant operators (//) and wildcards (*) but no predicates. For each document in the stream, the task is to determine all occurrences in the document of all the patterns. We present a general multiple-pattern-matching algorithm that is based on a backtracking deterministic finite automaton derived from the classic Aho–Corasick pattern-matching automaton. This automaton is of size linear in the sum of the sizes of the XPath patterns, and the worst-case time bound of the algorithm is better than the time bound of the simulation of linear-size nondeterministic automata. In addition to the worst-case-efficient general solution we present an algorithm with a simple backtracking mechanism that works extremely well for cases in which the backtracking stack remains low. Our experiments show that, when applied to filtering, this simple algorithm scales well as regards the number of patterns (or filters) and is competitive with YFilter, a widely accepted software for XML filtering.

References

  1. 1.
    Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. of the 36th Annual ACM Symposium on Theory of Computing, pp. 90–100 (2004)Google Scholar
  2. 2.
    Fischer, M., Paterson, M.: String matching and other products. In: Proc. of the 7th SIAM-AMS Complexity of Computation, pp. 113–125 (1974)Google Scholar
  3. 3.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)zbMATHCrossRefGoogle Scholar
  4. 4.
    Kucherov, G., Rusinowitch, M.: Matching a set of strings with variable length don’t cares. Theor. Comput. Sci. 178, 129–154 (1997)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings. Cambridge University Press, Cambridge (2002)zbMATHGoogle Scholar
  6. 6.
    Pinter, R.Y.: Efficient string matching. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO Advanced Science Institute Series F: Computer and System Sciences, vol. 12, pp. 11–29 (1985)Google Scholar
  7. 7.
    Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Communcations of the ACM 18, 333–340 (1975)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Altinel, M., Franklin, M.J.: Efficient filtering of XML documents for selective dissemination of information. In: VLDB 2000, Proc. of 26th Internat. Conf. on Very Large Data Bases, pp. 53–64 (2000)Google Scholar
  9. 9.
    YFilter: Filtering and transformation for high-volume XML message brokering, Department of Computer Science, University of Massachusetts, Amherst, http://yfilter.cs.umass.edu
  10. 10.
    Diao, Y., Altinel, M., Franklin, M.J., Zhang, H., Fischer, P.M.: Path sharing and predicate evaluation for high-performance XML filtering. ACM Trans. Database Syst. 28, 467–516 (2003)CrossRefGoogle Scholar
  11. 11.
    Green, T.J., Gupta, A., Miklau, G., Onizuka, M., Suciu, D.: Processing XML streams with deterministic automata and stream indexes. ACM Trans. Database Syst. 29, 752–788 (2004)CrossRefGoogle Scholar
  12. 12.
    Silvasti, P., Sippu, S., Soisalon-Soininen, E.: Schema-conscious filtering of XML documents. In: EDBT 2009, Proc. of the 12th Internat. Conf. on Extending Database Technology, pp. 970–981 (2009)Google Scholar
  13. 13.
    Sax Project Organization: Simple API for XML (2001), http://www.saxproject.org
  14. 14.
    Suciu, D.: XML data repository. The Database Research Group, University of Washington (2006), http://www.cs.washington.edu/research/xmldatasets/
  15. 15.
    NewsML: News exchange format (International Press Telecommunications Council), http://www.newsml.org

Copyright information

© IFIP 2010

Authors and Affiliations

  • Panu Silvasti
    • 1
  • Seppo Sippu
    • 2
  • Eljas Soisalon-Soininen
    • 1
  1. 1.School of Science and TechnologyAalto UniversityFinland
  2. 2.University of HelsinkiFinland

Personalised recommendations