Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Strict pattern matching under non-overlapping condition

无重叠条件的严格模式匹配

Abstract

Pattern matching (or string matching) is an essential task in computer science, especially in sequential pattern mining, since pattern matching methods can be used to calculate the support (or the number of occurrences) of a pattern and then to determine whether the pattern is frequent or not. A state-of-the-art sequential pattern mining with gap constraints (or flexible wildcards) uses the number of non-overlapping occurrences to denote the frequency of a pattern. Non-overlapping means that any two occurrences cannot use the same character of the sequence at the same position of the pattern. In this paper, we investigate strict pattern matching under the non-overlapping condition. We show that the problem is in P at first. Then we propose an algorithm, called NETLAP-Best, which uses Nettree structure. NETLAP-Best transforms the pattern matching problem into a Nettree and iterates to find the rightmost root-leaf path, to prune the useless nodes in the Nettree after removing the rightmost root-leaf path. We show that NETLAP-Best is a complete algorithm and analyse the time and space complexities of the algorithm. Extensive experimental results demonstrate the correctness and efficiency of NETLAP-Best.

摘要

创新点

模式匹配 (串匹配) 是计算机科学中至关重要的一个任务, 特别是在序列模式挖掘中, 因为模式匹配方法可以用来计算一个模式在序列中的支持度 (出现数), 进而判断这个模式是否频繁。 一种具有间隙约束 (可变长度通配符) 的序列模式挖掘算法采用模式的无重叠出现数目来表示这个模式的频度, 这里无重叠是指任何两个出现不能共用序列的相同位置的字符。 首先理论证明了无重叠条件的严格模式匹配的计算复杂度是 P, 然后提出了一个基于网树结构的 NETLAP-Best 算法, 该算法将模式匹配问题转换为一颗网树, 并在网树上迭代地寻找最右树根-叶子路径, 之后剪去这条路径和无用的网树结点。 之后理论证明了 NETLAP-Best 算法的完备性并分析了该算法的时间和空间复杂度。 大量实验结果验证了 NETLAP-Best 算法的正确性和有效性。

This is a preview of subscription content, log in to check access.

References

  1. 1

    Li C, Yang Q Y, Wang J Y, et al. Efficient mining of gap-constrained subsequences and its various applications. ACM Trans Knowl Discov Data, 2012, 6: 2

  2. 2

    Wang P, Xu B W, Wu Y R, et al. Link prediction in social networks: the state-of-the-art. Sci China Inf Sci, 2015, 58: 011101

  3. 3

    Liu J, Ma Z M, Feng X. Answering ordered tree pattern queries over fuzzy XML data. Knowl Inf Syst, 2015, 43: 473–495

  4. 4

    Xuan J F, Jiang H, Hu Y, et al. Towards effective bug triage with software data reduction techniques. IEEE Trans Knowl Data Eng, 2015, 27: 264–280

  5. 5

    Cook D, Krishnan N C, Rashidi P. Activity discovery and activity recognition: a new partnership. IEEE Trans Cybern, 2013, 43: 820–828

  6. 6

    Weng L N, Zhang P, Feng Z Y, et al. Short-term link quality prediction using nonparametric time series analysis. Sci China Inf Sci, 2015, 58: 082308

  7. 7

    Rajpathak D, De S. A data-and ontology-driven text mining-based construction of reliability model to analyze and predict component failures. Knowl Inf Syst, 2016, 46: 87–113

  8. 8

    Navarro G. Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput Surv, 2014, 46: 52

  9. 9

    Jiang H, Xuan J F, Ren Z L, et al. Misleading classification. Sci China Inf Sci, 2014, 57: 052106

  10. 10

    Le H, Prasanna V K. A memory-efficient and modular approach for large-scale string pattern matching. IEEE Trans Comput, 2013, 62: 844–857

  11. 11

    Claude F, Navarro G, Peltola H, et al. String matching with alphabet sampling. J Discrete Algorithms, 2012, 11: 37–50

  12. 12

    Wandelt S, Deng D, Gerdjikov S, et al. State-of-the-art in string similarity search and join. ACM SIGMOD Rec, 2014, 43: 64–76

  13. 13

    Li Z, Ge T J. Online windowed subsequence matching over probabilistic sequences. In: Proceedings of ACM International Conference on Management of Data. New York: ACM, 2012. 277–288

  14. 14

    Chen K-H, Huang G-S, Lee R C-T. Bit-parallel algorithms for exact circular string matching. Comput J, 2014, 57: 731–743

  15. 15

    Hu H, Wang H Z, Li J Z, et al. An efficient pruning strategy for approximate string matching over suffix tree. Knowl Inf Syst, 2016, 49: 121–141

  16. 16

    Li F F, Yao B, Tang M W, et al. Spatial approximate string search. IEEE Trans Knowl Data Eng, 2013, 25: 1394–1409

  17. 17

    Wu X D, Qiang J P, Xie F. Pattern matching with flexible wildcards. J Comput Sci Technol, 2014, 29: 740–750

  18. 18

    Wu Y X, Wu X D, Min F, et al. A Nettree for pattern matching with flexible wildcard constraints. In: Proceeding of IEEE International Conference on Information Reuse and Integration, Las Vegas, 2010. 109–114

  19. 19

    Retwitzer M D, Polishchuk M, Churkin E, et al. RNAPattMatch: a web server for RNA sequence/structure motif detection based on pattern matching with flexible gaps. Nucleic Acids Res, 2015, doi: 10.1093/nar/gkv435

  20. 20

    Wang X M, Duan L, Dong G Z, et al. Efficient mining of density-aware distinguishing sequential patterns with gap constraints. In: Proceedings of International Conference Database Systems for Advanced Applications, Bali, 2014. 372–387

  21. 21

    Liao V C-C, Chen M-S. Efficient mining gapped sequential patterns for motifs in biological sequences. BMC Syst Biol, 2013, 7: S7

  22. 22

    Ding B L, Lo D, Han J W, et al. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proceedings of IEEE International Conference on Data Engineering, Shanghai, 2009. 1024–1035

  23. 23

    Yang H, Duan L, Hu B, et al. Mining top-k distinguishing sequential patterns with gap constraint. J Softw, 2015, 26: 2994–3009

  24. 24

    Crochemore M, Iliopoulos C, Makris C, et al. Approximate string matching with gaps. Nordic J Comput, 2002, 9: 54–65

  25. 25

    Cantone D, Cristofaro S, Faro S. New efficient bit-parallel algorithms for the (δ, α)-matching problem with applications in music information retrieval. Int J Found Comput Sci, 2009, 20: 1087–1108

  26. 26

    Cole J, Chai B, Farris R, et al. The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res, 2005, 33: 294–296

  27. 27

    Cole R, Gottlieb L, Lewenstein M. Dictionary matching and indexing with errors and don’t care. In: Proceeding of Symposium on Theory of Computing, Chicago, 2004. 91–100

  28. 28

    Zhang M H, Kao B, Cheung D W, et al. Mining periodic patterns with gap requirement from sequences. ACM Trans Knowl Discov Data, 2007, 1: 7

  29. 29

    Wu Y X, Wang L L, Ren J D, et al. Mining sequential patterns with periodic wildcard gaps. Appl Intell, 2014, 41: 99–116

  30. 30

    Wu X D, Zhu X Q, He Y, et al. PMBC: pattern mining from biological sequences with wildcard constraints. Comput Biol Med, 2013, 43: 481–492

  31. 31

    Ibrahim A, Sastry S, Sastry P S. Discovering compressing serial episodes from event sequences. Knowl Inf Syst, 2016, 47: 405–432

  32. 32

    Lam H, Mörchen F, Fradkin D, et al. Mining compressing sequential patterns. Stat Anal Data Min, 2013, 7: 34–52

  33. 33

    El-Ramly M, Stroulia E, Sorenson P. From run-time behavior to usage scenarios: an interaction-pattern mining approach. In: Proceeding of ACM International Conference on Knowledge Discovery and Data Mining, Edmonton, 2002. 315–324

  34. 34

    Bille P, Gørtz I, Vildhøj H W, et al. String matching with variable length gaps. Theor Comput Sci, 2012, 443: 25–34

  35. 35

    Wu Y X, Fu S, Jiang H, et al. Strict approximate pattern matching with general gaps. Appl Intell, 2015, 42: 566–580

  36. 36

    Wu Y X, Tang Z Q, Jiang H, et al. Approximate pattern matching with gap constraints. J Inf Sci, 2016, 42: 639–658

  37. 37

    Chai X, Jia X F, Wu Y X, et al. Strict pattern matching with general gaps and one-off condition (in Chinese). J Softw, 2015, 26: 1096–1112

  38. 38

    Guo D, Hu X G, Xie F, et al. Pattern matching with wildcards and gap-Length constraints based on a centrality-degree graph. Appl Intell, 2013, 39: 57–74

  39. 39

    Wu Y X, Wu X D, Jiang H, et al. A heuristic algorithm for MPMGOOC. Chin J Comput, 2011, 34: 1452–1462

Download references

Author information

Correspondence to Youxi Wu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wu, Y., Shen, C., Jiang, H. et al. Strict pattern matching under non-overlapping condition. Sci. China Inf. Sci. 60, 012101 (2017). https://doi.org/10.1007/s11432-015-0935-3

Download citation

Keywords

  • pattern matching
  • sequential pattern mining
  • gap constraint
  • flexible wildcard
  • non-overlapping
  • occurrence
  • Nettree

关键词

  • 模式匹配
  • 序列模式挖掘
  • 间隙约束
  • 通配符
  • 无重叠
  • 出现
  • 网树