Advertisement

World Wide Web

, Volume 22, Issue 6, pp 2519–2543 | Cite as

Efficient regular expression matching on LZ77 compressed strings using negative factors

  • Yutong Han
  • Bin WangEmail author
  • Xiaochun Yang
  • Tao Qiu
  • Huaijie Zhu
Article
  • 70 Downloads
Part of the following topical collections:
  1. Special Issue on Web and Big Data

Abstract

The state-of-the-art approaches for regular expression matching on LZ78 compressed strings do not perform efficiently. Moreover, LZ78 compression has some shortcomings, such as higher compression ratio and slower decompression speed than LZ77 (a variant of LZ78). In this paper, we study regular expression matching on LZ77 compressed strings. To address this problem, we propose an efficient algorithm, namely, RELZ, utilizing the positive factors, i.e., the prefix and the suffix, and negative factors (Negative factors are substrings that cannot appear in an answer.) of the regular expression to prune the candidates. For the sake of quickly locating these two kinds of factors on the compressed string without decompression, we design a variant of suffix trie index, called SSLZ. We construct bitmaps for factors of regular expression to detect candidates. Moreover, due to the high space cost of SSLZ, we propose a variant index that partially maintain suffixes of the phrases with high frequency and develop an efficient regular expression algorithm based on the novel index, namely, RELZ+. In addition, two optimization strategies employing block filtering and LZ filtering are proposed to prune false negative candidates. At last, we conduct a comprehensive performance evaluation depending on four real data sets to validate our ideas and the proposed algorithms. The experimental results show that our RELZ and RELZ+ algorithms significantly outperform the existing algorithms.

Keywords

Regular expression Compressed string LZ77 Negative factor 

Notes

Acknowledgements

The work is partially supported by the National Natural Science Foundation of China (Nos. 61572122, U1736104 , 61532021), Liaoning BaiQianWan Talents Program, and the Fundamental Research Funds for the Central Universities (No. N171602003).

References

  1. 1.
    Becchi, M., Bremler-Barr, A., Hay, D., Kochba, O., Koral, Y.: Accelerating regular expression matching over compressed http. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 540–548. IEEE (2015)Google Scholar
  2. 2.
    Bille, P., Fagerberg, R., Gortz, I.L.: Improved approximate string matching and regular expression matching on ziv-lempel compressed texts. In: Proceedings of the 18th Annual Conference on Combinatorial Pattern Matching, pp. 52–62 (2007)Google Scholar
  3. 3.
    Bille, P., Farach-Colton, M.: Fast and compact regular expression matching. Theor. Comput. Sci. 409(3), 486–496 (2008)MathSciNetCrossRefGoogle Scholar
  4. 4.
    GNUgrep: Haertel, mike. www.gnu.org/software/grep/manual/
  5. 5.
    González, R., Grabowski, S., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. In: Poster Proc. Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA), pp. 27–38 (2005)Google Scholar
  6. 6.
    Han, Y., Wang, B., Yang, X., Zhu, H.: Efficient regular expression matching on compressed strings. In: International Conference on Database Systems for Advanced Applications, pp. 219–234. Springer (2017)Google Scholar
  7. 7.
    Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S., Pagni, M., Sigrist, C.J.: The prosite database. Nucleic Acids Res. 34(suppl_1), D227–D230 (2006)CrossRefGoogle Scholar
  8. 8.
    Kreft, S., Navarro, G.: Self-index based on lz77 (thesis). arXiv preprint arXiv:1112.4578 (2011)
  9. 9.
    Kreft, S., Navarro, G.: Self-indexing based on lz77. In: Combinatorial Pattern Matching, pp. 41–54. Springer (2011)Google Scholar
  10. 10.
    Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Li, Z., Wang, H., Shao, W., Li, J., Gao, H.: Repairing data through regular expressions. Proc. VLDB Endow. 9(5), 432–443 (2016)CrossRefGoogle Scholar
  12. 12.
    Navarro, G.: Nr-grep: a fast and flexible pattern-matching tool. Softw. Pract. Exp. 31(13), 1265–1312 (2001)CrossRefGoogle Scholar
  13. 13.
    Navarro, G.: Regular expression searching over ziv-lempel compressed text. In: Annual Symposium on Combinatorial Pattern Matching, pp. 1–17. Springer (2001)Google Scholar
  14. 14.
    Navarro, G.: Regular expression searching on compressed text. J. Discrete Algoritms 1(5–6), 423–443 (2003)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Navarro, G.: A self-index on block trees. In: International Symposium on String Processing and Information Retrieval, pp. 278–289. Springer (2017)Google Scholar
  16. 16.
    Navarro, G., Raffinot, M.: Fast regular expression search. In: International Workshop on Algorithm Engineering, pp. 198–212 (1999)CrossRefGoogle Scholar
  17. 17.
    Navarro, G., Raffinot, M.: Compact DFA Representation for Fast Regular Expression Search. Springer, Berlin (2001)CrossRefGoogle Scholar
  18. 18.
    Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., Weigel, D.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10(9), R98 (2009)CrossRefGoogle Scholar
  19. 19.
    Thompson, K.: Programming techniques: regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)CrossRefGoogle Scholar
  20. 20.
    Wang, K., Li, J.: Towards fast regular expression matching in practice. ACM SIGCOMM Comput. Commun. Rev. 43(4), 531–532 (2013)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Wu, S.: Fast text searching: allowing errors. Commun. ACM 35(10), 83–91 (1992)CrossRefGoogle Scholar
  22. 22.
    Xu, C., Chen, S., Su, J., Yiu, S., Hui, L.C.: A survey on regular expression matching for deep packet inspection: applications, algorithms, and hardware platforms. IEEE Commun. Surv. Tutor. 18(4), 2991–3029 (2016)CrossRefGoogle Scholar
  23. 23.
    Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 25:1–25:46 (2016)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Yang, X., Wang, B., Li, C., Wang, J.: Efficient direct search on compressed genomic data. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 961–972 (2013)Google Scholar
  25. 25.
    Yang, X., Wang, B., Qiu, T., Wang, Y., Li, C.: Improving regular-expression matching on strings using negative factors. In: ACM SIGMOD International Conference on Management of Data, pp. 361–372 (2013)Google Scholar
  26. 26.
    Yu, F., Chen, Z., Diao, Y., Lakshman, T., Katz, R.H.: Fast and memory-efficient regular expression matching for deep packet inspection. In: ACM/IEEE Symposium on Architecture for Networking and Communications Systems, 2006. ANCS 2006, pp. 93–102. IEEE (2006)Google Scholar
  27. 27.
    Zhang, M., Zhang, Y., Hou, C.: Compact representations of automata for regular expression matching. Inf. Process. Lett. 116(12), 750–756 (2016)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Yutong Han
    • 1
  • Bin Wang
    • 1
    Email author
  • Xiaochun Yang
    • 1
  • Tao Qiu
    • 1
  • Huaijie Zhu
    • 1
  1. 1.School of Computer Science and EngineeringNortheastern UniversityShenyangChina

Personalised recommendations