Strict pattern matching under non-overlapping condition

无重叠条件的严格模式匹配

Abstract

Pattern matching (or string matching) is an essential task in computer science, especially in sequential pattern mining, since pattern matching methods can be used to calculate the support (or the number of occurrences) of a pattern and then to determine whether the pattern is frequent or not. A state-of-the-art sequential pattern mining with gap constraints (or flexible wildcards) uses the number of non-overlapping occurrences to denote the frequency of a pattern. Non-overlapping means that any two occurrences cannot use the same character of the sequence at the same position of the pattern. In this paper, we investigate strict pattern matching under the non-overlapping condition. We show that the problem is in P at first. Then we propose an algorithm, called NETLAP-Best, which uses Nettree structure. NETLAP-Best transforms the pattern matching problem into a Nettree and iterates to find the rightmost root-leaf path, to prune the useless nodes in the Nettree after removing the rightmost root-leaf path. We show that NETLAP-Best is a complete algorithm and analyse the time and space complexities of the algorithm. Extensive experimental results demonstrate the correctness and efficiency of NETLAP-Best.

摘要

创新点

模式匹配 (串匹配) 是计算机科学中至关重要的一个任务, 特别是在序列模式挖掘中, 因为模式匹配方法可以用来计算一个模式在序列中的支持度 (出现数), 进而判断这个模式是否频繁。 一种具有间隙约束 (可变长度通配符) 的序列模式挖掘算法采用模式的无重叠出现数目来表示这个模式的频度, 这里无重叠是指任何两个出现不能共用序列的相同位置的字符。 首先理论证明了无重叠条件的严格模式匹配的计算复杂度是 P, 然后提出了一个基于网树结构的 NETLAP-Best 算法, 该算法将模式匹配问题转换为一颗网树, 并在网树上迭代地寻找最右树根-叶子路径, 之后剪去这条路径和无用的网树结点。 之后理论证明了 NETLAP-Best 算法的完备性并分析了该算法的时间和空间复杂度。 大量实验结果验证了 NETLAP-Best 算法的正确性和有效性。

This is a preview of subscription content, access via your institution.

References

  1. 1

    Li C, Yang Q Y, Wang J Y, et al. Efficient mining of gap-constrained subsequences and its various applications. ACM Trans Knowl Discov Data, 2012, 6: 2

    MathSciNet  Article  Google Scholar 

  2. 2

    Wang P, Xu B W, Wu Y R, et al. Link prediction in social networks: the state-of-the-art. Sci China Inf Sci, 2015, 58: 011101

    Google Scholar 

  3. 3

    Liu J, Ma Z M, Feng X. Answering ordered tree pattern queries over fuzzy XML data. Knowl Inf Syst, 2015, 43: 473–495

    Article  Google Scholar 

  4. 4

    Xuan J F, Jiang H, Hu Y, et al. Towards effective bug triage with software data reduction techniques. IEEE Trans Knowl Data Eng, 2015, 27: 264–280

    Article  Google Scholar 

  5. 5

    Cook D, Krishnan N C, Rashidi P. Activity discovery and activity recognition: a new partnership. IEEE Trans Cybern, 2013, 43: 820–828

    Article  Google Scholar 

  6. 6

    Weng L N, Zhang P, Feng Z Y, et al. Short-term link quality prediction using nonparametric time series analysis. Sci China Inf Sci, 2015, 58: 082308

    Article  Google Scholar 

  7. 7

    Rajpathak D, De S. A data-and ontology-driven text mining-based construction of reliability model to analyze and predict component failures. Knowl Inf Syst, 2016, 46: 87–113

    Article  Google Scholar 

  8. 8

    Navarro G. Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput Surv, 2014, 46: 52

    Article  MATH  Google Scholar 

  9. 9

    Jiang H, Xuan J F, Ren Z L, et al. Misleading classification. Sci China Inf Sci, 2014, 57: 052106

    Google Scholar 

  10. 10

    Le H, Prasanna V K. A memory-efficient and modular approach for large-scale string pattern matching. IEEE Trans Comput, 2013, 62: 844–857

    MathSciNet  Article  Google Scholar 

  11. 11

    Claude F, Navarro G, Peltola H, et al. String matching with alphabet sampling. J Discrete Algorithms, 2012, 11: 37–50

    MathSciNet  Article  MATH  Google Scholar 

  12. 12

    Wandelt S, Deng D, Gerdjikov S, et al. State-of-the-art in string similarity search and join. ACM SIGMOD Rec, 2014, 43: 64–76

    Article  Google Scholar 

  13. 13

    Li Z, Ge T J. Online windowed subsequence matching over probabilistic sequences. In: Proceedings of ACM International Conference on Management of Data. New York: ACM, 2012. 277–288

    Google Scholar 

  14. 14

    Chen K-H, Huang G-S, Lee R C-T. Bit-parallel algorithms for exact circular string matching. Comput J, 2014, 57: 731–743

    Article  Google Scholar 

  15. 15

    Hu H, Wang H Z, Li J Z, et al. An efficient pruning strategy for approximate string matching over suffix tree. Knowl Inf Syst, 2016, 49: 121–141

    Article  Google Scholar 

  16. 16

    Li F F, Yao B, Tang M W, et al. Spatial approximate string search. IEEE Trans Knowl Data Eng, 2013, 25: 1394–1409

    Article  Google Scholar 

  17. 17

    Wu X D, Qiang J P, Xie F. Pattern matching with flexible wildcards. J Comput Sci Technol, 2014, 29: 740–750

    MathSciNet  Article  Google Scholar 

  18. 18

    Wu Y X, Wu X D, Min F, et al. A Nettree for pattern matching with flexible wildcard constraints. In: Proceeding of IEEE International Conference on Information Reuse and Integration, Las Vegas, 2010. 109–114

    Google Scholar 

  19. 19

    Retwitzer M D, Polishchuk M, Churkin E, et al. RNAPattMatch: a web server for RNA sequence/structure motif detection based on pattern matching with flexible gaps. Nucleic Acids Res, 2015, doi: 10.1093/nar/gkv435

    Google Scholar 

  20. 20

    Wang X M, Duan L, Dong G Z, et al. Efficient mining of density-aware distinguishing sequential patterns with gap constraints. In: Proceedings of International Conference Database Systems for Advanced Applications, Bali, 2014. 372–387

    Chapter  Google Scholar 

  21. 21

    Liao V C-C, Chen M-S. Efficient mining gapped sequential patterns for motifs in biological sequences. BMC Syst Biol, 2013, 7: S7

    Article  Google Scholar 

  22. 22

    Ding B L, Lo D, Han J W, et al. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proceedings of IEEE International Conference on Data Engineering, Shanghai, 2009. 1024–1035

    Google Scholar 

  23. 23

    Yang H, Duan L, Hu B, et al. Mining top-k distinguishing sequential patterns with gap constraint. J Softw, 2015, 26: 2994–3009

    MathSciNet  MATH  Google Scholar 

  24. 24

    Crochemore M, Iliopoulos C, Makris C, et al. Approximate string matching with gaps. Nordic J Comput, 2002, 9: 54–65

    MathSciNet  MATH  Google Scholar 

  25. 25

    Cantone D, Cristofaro S, Faro S. New efficient bit-parallel algorithms for the (δ, α)-matching problem with applications in music information retrieval. Int J Found Comput Sci, 2009, 20: 1087–1108

    MathSciNet  Article  MATH  Google Scholar 

  26. 26

    Cole J, Chai B, Farris R, et al. The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res, 2005, 33: 294–296

    Article  Google Scholar 

  27. 27

    Cole R, Gottlieb L, Lewenstein M. Dictionary matching and indexing with errors and don’t care. In: Proceeding of Symposium on Theory of Computing, Chicago, 2004. 91–100

    Google Scholar 

  28. 28

    Zhang M H, Kao B, Cheung D W, et al. Mining periodic patterns with gap requirement from sequences. ACM Trans Knowl Discov Data, 2007, 1: 7

    Article  Google Scholar 

  29. 29

    Wu Y X, Wang L L, Ren J D, et al. Mining sequential patterns with periodic wildcard gaps. Appl Intell, 2014, 41: 99–116

    Article  Google Scholar 

  30. 30

    Wu X D, Zhu X Q, He Y, et al. PMBC: pattern mining from biological sequences with wildcard constraints. Comput Biol Med, 2013, 43: 481–492

    Article  Google Scholar 

  31. 31

    Ibrahim A, Sastry S, Sastry P S. Discovering compressing serial episodes from event sequences. Knowl Inf Syst, 2016, 47: 405–432

    Article  Google Scholar 

  32. 32

    Lam H, Mörchen F, Fradkin D, et al. Mining compressing sequential patterns. Stat Anal Data Min, 2013, 7: 34–52

    MathSciNet  Article  Google Scholar 

  33. 33

    El-Ramly M, Stroulia E, Sorenson P. From run-time behavior to usage scenarios: an interaction-pattern mining approach. In: Proceeding of ACM International Conference on Knowledge Discovery and Data Mining, Edmonton, 2002. 315–324

    Google Scholar 

  34. 34

    Bille P, Gørtz I, Vildhøj H W, et al. String matching with variable length gaps. Theor Comput Sci, 2012, 443: 25–34

    MathSciNet  Article  MATH  Google Scholar 

  35. 35

    Wu Y X, Fu S, Jiang H, et al. Strict approximate pattern matching with general gaps. Appl Intell, 2015, 42: 566–580

    Article  Google Scholar 

  36. 36

    Wu Y X, Tang Z Q, Jiang H, et al. Approximate pattern matching with gap constraints. J Inf Sci, 2016, 42: 639–658

    Article  Google Scholar 

  37. 37

    Chai X, Jia X F, Wu Y X, et al. Strict pattern matching with general gaps and one-off condition (in Chinese). J Softw, 2015, 26: 1096–1112

    MathSciNet  Google Scholar 

  38. 38

    Guo D, Hu X G, Xie F, et al. Pattern matching with wildcards and gap-Length constraints based on a centrality-degree graph. Appl Intell, 2013, 39: 57–74

    Article  Google Scholar 

  39. 39

    Wu Y X, Wu X D, Jiang H, et al. A heuristic algorithm for MPMGOOC. Chin J Comput, 2011, 34: 1452–1462

    MathSciNet  Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Youxi Wu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wu, Y., Shen, C., Jiang, H. et al. Strict pattern matching under non-overlapping condition. Sci. China Inf. Sci. 60, 012101 (2017). https://doi.org/10.1007/s11432-015-0935-3

Download citation

Keywords

  • pattern matching
  • sequential pattern mining
  • gap constraint
  • flexible wildcard
  • non-overlapping
  • occurrence
  • Nettree

关键词

  • 模式匹配
  • 序列模式挖掘
  • 间隙约束
  • 通配符
  • 无重叠
  • 出现
  • 网树