Advertisement

Strict pattern matching under non-overlapping condition

  • Youxi Wu
  • Cong Shen
  • He Jiang
  • Xindong Wu
Research Paper

Abstract

Pattern matching (or string matching) is an essential task in computer science, especially in sequential pattern mining, since pattern matching methods can be used to calculate the support (or the number of occurrences) of a pattern and then to determine whether the pattern is frequent or not. A state-of-the-art sequential pattern mining with gap constraints (or flexible wildcards) uses the number of non-overlapping occurrences to denote the frequency of a pattern. Non-overlapping means that any two occurrences cannot use the same character of the sequence at the same position of the pattern. In this paper, we investigate strict pattern matching under the non-overlapping condition. We show that the problem is in P at first. Then we propose an algorithm, called NETLAP-Best, which uses Nettree structure. NETLAP-Best transforms the pattern matching problem into a Nettree and iterates to find the rightmost root-leaf path, to prune the useless nodes in the Nettree after removing the rightmost root-leaf path. We show that NETLAP-Best is a complete algorithm and analyse the time and space complexities of the algorithm. Extensive experimental results demonstrate the correctness and efficiency of NETLAP-Best.

Keywords

pattern matching sequential pattern mining gap constraint flexible wildcard non-overlapping occurrence Nettree 

无重叠条件的严格模式匹配

摘要

创新点

模式匹配 (串匹配) 是计算机科学中至关重要的一个任务, 特别是在序列模式挖掘中, 因为模式匹配方法可以用来计算一个模式在序列中的支持度 (出现数), 进而判断这个模式是否频繁。 一种具有间隙约束 (可变长度通配符) 的序列模式挖掘算法采用模式的无重叠出现数目来表示这个模式的频度, 这里无重叠是指任何两个出现不能共用序列的相同位置的字符。 首先理论证明了无重叠条件的严格模式匹配的计算复杂度是 P, 然后提出了一个基于网树结构的 NETLAP-Best 算法, 该算法将模式匹配问题转换为一颗网树, 并在网树上迭代地寻找最右树根-叶子路径, 之后剪去这条路径和无用的网树结点。 之后理论证明了 NETLAP-Best 算法的完备性并分析了该算法的时间和空间复杂度。 大量实验结果验证了 NETLAP-Best 算法的正确性和有效性。

关键词

模式匹配 序列模式挖掘 间隙约束 通配符 无重叠 出现 网树 

References

  1. 1.
    Li C, Yang Q Y, Wang J Y, et al. Efficient mining of gap-constrained subsequences and its various applications. ACM Trans Knowl Discov Data, 2012, 6: 2MathSciNetCrossRefGoogle Scholar
  2. 2.
    Wang P, Xu B W, Wu Y R, et al. Link prediction in social networks: the state-of-the-art. Sci China Inf Sci, 2015, 58: 011101Google Scholar
  3. 3.
    Liu J, Ma Z M, Feng X. Answering ordered tree pattern queries over fuzzy XML data. Knowl Inf Syst, 2015, 43: 473–495CrossRefGoogle Scholar
  4. 4.
    Xuan J F, Jiang H, Hu Y, et al. Towards effective bug triage with software data reduction techniques. IEEE Trans Knowl Data Eng, 2015, 27: 264–280CrossRefGoogle Scholar
  5. 5.
    Cook D, Krishnan N C, Rashidi P. Activity discovery and activity recognition: a new partnership. IEEE Trans Cybern, 2013, 43: 820–828CrossRefGoogle Scholar
  6. 6.
    Weng L N, Zhang P, Feng Z Y, et al. Short-term link quality prediction using nonparametric time series analysis. Sci China Inf Sci, 2015, 58: 082308CrossRefGoogle Scholar
  7. 7.
    Rajpathak D, De S. A data-and ontology-driven text mining-based construction of reliability model to analyze and predict component failures. Knowl Inf Syst, 2016, 46: 87–113CrossRefGoogle Scholar
  8. 8.
    Navarro G. Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput Surv, 2014, 46: 52CrossRefzbMATHGoogle Scholar
  9. 9.
    Jiang H, Xuan J F, Ren Z L, et al. Misleading classification. Sci China Inf Sci, 2014, 57: 052106Google Scholar
  10. 10.
    Le H, Prasanna V K. A memory-efficient and modular approach for large-scale string pattern matching. IEEE Trans Comput, 2013, 62: 844–857MathSciNetCrossRefGoogle Scholar
  11. 11.
    Claude F, Navarro G, Peltola H, et al. String matching with alphabet sampling. J Discrete Algorithms, 2012, 11: 37–50MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Wandelt S, Deng D, Gerdjikov S, et al. State-of-the-art in string similarity search and join. ACM SIGMOD Rec, 2014, 43: 64–76CrossRefGoogle Scholar
  13. 13.
    Li Z, Ge T J. Online windowed subsequence matching over probabilistic sequences. In: Proceedings of ACM International Conference on Management of Data. New York: ACM, 2012. 277–288Google Scholar
  14. 14.
    Chen K-H, Huang G-S, Lee R C-T. Bit-parallel algorithms for exact circular string matching. Comput J, 2014, 57: 731–743CrossRefGoogle Scholar
  15. 15.
    Hu H, Wang H Z, Li J Z, et al. An efficient pruning strategy for approximate string matching over suffix tree. Knowl Inf Syst, 2016, 49: 121–141CrossRefGoogle Scholar
  16. 16.
    Li F F, Yao B, Tang M W, et al. Spatial approximate string search. IEEE Trans Knowl Data Eng, 2013, 25: 1394–1409CrossRefGoogle Scholar
  17. 17.
    Wu X D, Qiang J P, Xie F. Pattern matching with flexible wildcards. J Comput Sci Technol, 2014, 29: 740–750MathSciNetCrossRefGoogle Scholar
  18. 18.
    Wu Y X, Wu X D, Min F, et al. A Nettree for pattern matching with flexible wildcard constraints. In: Proceeding of IEEE International Conference on Information Reuse and Integration, Las Vegas, 2010. 109–114Google Scholar
  19. 19.
    Retwitzer M D, Polishchuk M, Churkin E, et al. RNAPattMatch: a web server for RNA sequence/structure motif detection based on pattern matching with flexible gaps. Nucleic Acids Res, 2015, doi: 10.1093/nar/gkv435Google Scholar
  20. 20.
    Wang X M, Duan L, Dong G Z, et al. Efficient mining of density-aware distinguishing sequential patterns with gap constraints. In: Proceedings of International Conference Database Systems for Advanced Applications, Bali, 2014. 372–387CrossRefGoogle Scholar
  21. 21.
    Liao V C-C, Chen M-S. Efficient mining gapped sequential patterns for motifs in biological sequences. BMC Syst Biol, 2013, 7: S7CrossRefGoogle Scholar
  22. 22.
    Ding B L, Lo D, Han J W, et al. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proceedings of IEEE International Conference on Data Engineering, Shanghai, 2009. 1024–1035Google Scholar
  23. 23.
    Yang H, Duan L, Hu B, et al. Mining top-k distinguishing sequential patterns with gap constraint. J Softw, 2015, 26: 2994–3009MathSciNetzbMATHGoogle Scholar
  24. 24.
    Crochemore M, Iliopoulos C, Makris C, et al. Approximate string matching with gaps. Nordic J Comput, 2002, 9: 54–65MathSciNetzbMATHGoogle Scholar
  25. 25.
    Cantone D, Cristofaro S, Faro S. New efficient bit-parallel algorithms for the (δ, α)-matching problem with applications in music information retrieval. Int J Found Comput Sci, 2009, 20: 1087–1108MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Cole J, Chai B, Farris R, et al. The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res, 2005, 33: 294–296CrossRefGoogle Scholar
  27. 27.
    Cole R, Gottlieb L, Lewenstein M. Dictionary matching and indexing with errors and don’t care. In: Proceeding of Symposium on Theory of Computing, Chicago, 2004. 91–100Google Scholar
  28. 28.
    Zhang M H, Kao B, Cheung D W, et al. Mining periodic patterns with gap requirement from sequences. ACM Trans Knowl Discov Data, 2007, 1: 7CrossRefGoogle Scholar
  29. 29.
    Wu Y X, Wang L L, Ren J D, et al. Mining sequential patterns with periodic wildcard gaps. Appl Intell, 2014, 41: 99–116CrossRefGoogle Scholar
  30. 30.
    Wu X D, Zhu X Q, He Y, et al. PMBC: pattern mining from biological sequences with wildcard constraints. Comput Biol Med, 2013, 43: 481–492CrossRefGoogle Scholar
  31. 31.
    Ibrahim A, Sastry S, Sastry P S. Discovering compressing serial episodes from event sequences. Knowl Inf Syst, 2016, 47: 405–432CrossRefGoogle Scholar
  32. 32.
    Lam H, Mörchen F, Fradkin D, et al. Mining compressing sequential patterns. Stat Anal Data Min, 2013, 7: 34–52MathSciNetCrossRefGoogle Scholar
  33. 33.
    El-Ramly M, Stroulia E, Sorenson P. From run-time behavior to usage scenarios: an interaction-pattern mining approach. In: Proceeding of ACM International Conference on Knowledge Discovery and Data Mining, Edmonton, 2002. 315–324Google Scholar
  34. 34.
    Bille P, Gørtz I, Vildhøj H W, et al. String matching with variable length gaps. Theor Comput Sci, 2012, 443: 25–34MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Wu Y X, Fu S, Jiang H, et al. Strict approximate pattern matching with general gaps. Appl Intell, 2015, 42: 566–580CrossRefGoogle Scholar
  36. 36.
    Wu Y X, Tang Z Q, Jiang H, et al. Approximate pattern matching with gap constraints. J Inf Sci, 2016, 42: 639–658CrossRefGoogle Scholar
  37. 37.
    Chai X, Jia X F, Wu Y X, et al. Strict pattern matching with general gaps and one-off condition (in Chinese). J Softw, 2015, 26: 1096–1112MathSciNetGoogle Scholar
  38. 38.
    Guo D, Hu X G, Xie F, et al. Pattern matching with wildcards and gap-Length constraints based on a centrality-degree graph. Appl Intell, 2013, 39: 57–74CrossRefGoogle Scholar
  39. 39.
    Wu Y X, Wu X D, Jiang H, et al. A heuristic algorithm for MPMGOOC. Chin J Comput, 2011, 34: 1452–1462MathSciNetCrossRefGoogle Scholar

Copyright information

© Science China Press and Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringHebei University of TechnologyTianjinChina
  2. 2.Hebei Province Key Laboratory of Big Data CalculationTianjinChina
  3. 3.School of Computer Science and TechnologyTianjin UniversityTianjinChina
  4. 4.School of SoftwareDalian University of TechnologyDalianChina
  5. 5.School of Computer Science and Information EngineeringHefei University of TechnologyHefeiChina
  6. 6.School of Computing and InformaticsUniversity of Louisiana at LafayetteLafayetteUSA

Personalised recommendations