Skip to main content

Experimental Analysis of an Online Dictionary Matching Algorithm for Regular Expressions with Gaps

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9125))

Abstract

Dictionary matching for regular expressions has gained recent interest because of a multitude of applications, including DNA sequence analysis, XML filtering, and network traffic analysis. In some applications, allowing wildcard and character class gaps in strings is enough, but usually the full expressive power of regular expressions is needed. In this paper we present and analyze a new algorithm for online dictionary matching for regular expressions. The unique feature of our algorithm is that it builds upon an algorithm for dictionary matching of string patterns with wildcard gaps, but is also capable of treating more complex regular expressions. In our experiments we used real data from expressions used for filtering spam e-mail. The size of the dictionary, that is, the number of different regular expressions to be matched varied from one to 3080. To find out how our algorithm scales to much larger numbers of patterns, we made small random changes to these patterns to produce up to 100000 patterns that are similar in style. We found out that the scalability of our algorithm is very good, being at its best for 10000–20000 patterns. Our algorithm outperforms the tested competitors for large dictionaries, GNU grep already for tens of patterns and Google’s RE2 for hundreds of patterns.

This research was partially supported by the Academy of Finland.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aho, A.V., Corasick, M.J.: Efficient String Matching: an Aid to Bibliographic Search. Commun. of the ACM 18, 333–340 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  2. Amir, A., Levy, A., Porat, E., Shalom, B.R.: Dictionary matching with one gap. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 11–20. Springer, Heidelberg (2014)

    Google Scholar 

  3. Bille, P., Gørtz, I.L., Vildhøj, H.W., Wind, D.K.: String Matching with Variable Length Gaps. Theoretical Computer Science 443, 25–34 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  4. Bille, P., Thorup, M.: Regular expression matching with multi-strings and intervals. In: Proc. of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1297–1308 (2010)

    Google Scholar 

  5. Bucher, P., Bairoch, A.: A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In: Proc. of Intelligent Systems for Molecular Biology, ISMB, pp. 53–61 (1994)

    Google Scholar 

  6. De Castro, E., Sigrist, C.J.A., Gattiker, A., Bulliard, V., Langendijk-Genevaux, P.S., Gasteiger, E., Bairoch, E.A., Hulo, N.: ScanProsite: Detection of PROSITE Signature Matches and ProRule-Associated Functional and Structural Residues in Proteins. Nucleic Acids Res. 34, 362–365 (2006)

    Article  Google Scholar 

  7. Diao, Y., Rizvi, S., Franklin, M.J.: Towards an internet-scale XML dissemination service. In: Proc. of Very Large Data Bases, VLDB, pp. 612–623 (2004)

    Google Scholar 

  8. Haapasalo, T., Silvasti, P., Sippu, S., Soisalon-Soininen, E.: Online dictionary matching with variable-length gaps. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 76–87. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  9. Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured Motifs Search. J. Comput. Biol. 12, 1065–1082 (2005)

    Article  Google Scholar 

  10. Navarro, G.: NR-Grep: A Fast and Flexible Pattern-Matching Tool. Software - Practice and Experience - SPE 31, 1265–1312 (2001)

    Article  MATH  Google Scholar 

  11. The Open Group, Regular Expressions, Chapter 9 of The Open Group Base Specifications Issue 6, Base Definitions volume, IEEE Std 1003.1, The IEEE and the Open Group (2004)

    Google Scholar 

  12. Pinter, R.Y.: Efficient string matching. In: Combinatorial Algorithms on Words, NATO Advanced Science Institute Series F: Computer and System Sciences, vol. 12, pp. 11–29 (1985)

    Google Scholar 

  13. Rahman, M.S., Iliopoulos, C.S., Lee, I., Mohamed, M., Smyth, W.F.: Finding patterns with variable length gaps or don’t cares. In: Chen, D.Z., Lee, D.T. (eds.) COCOON 2006. LNCS, vol. 4112, pp. 146–155. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Sippu, S., Soisalon-Soininen, E.: Online matching of multiple regular patterns with gaps and character classes. In: Dediu, A.-H., Martín-Vide, C., Truthe, B. (eds.) LATA 2013. LNCS, vol. 7810, pp. 523–534. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eljas Soisalon-Soininen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Saikkonen, R., Sippu, S., Soisalon-Soininen, E. (2015). Experimental Analysis of an Online Dictionary Matching Algorithm for Regular Expressions with Gaps. In: Bampis, E. (eds) Experimental Algorithms. SEA 2015. Lecture Notes in Computer Science(), vol 9125. Springer, Cham. https://doi.org/10.1007/978-3-319-20086-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20086-6_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20085-9

  • Online ISBN: 978-3-319-20086-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics