New Generation Computing

, Volume 32, Issue 2, pp 95–122 | Cite as

A Matching Algorithm in PMWL based on CluTree

Article
  • 108 Downloads

Abstract

Pattern matching with wildcards and length constraints (PMWL) is a complex problem which has important applications in bioinformatics, network security and information retrieval. Existing algorithms use the traditional left-most strategy when selecting among multiple candidate matching positions, which leads to incomplete final matching results. This paper presents a new data structure CluTree and a new matching algorithm RBCT*1 based on CluTree. After establishing a cluster of trees with red and black nodes according to a pattern P and a text T, which is called CluTree, our RBCT algorithm uses the sharing degree, correlation degree and mixed information entropy of each node in the CluTree for path selection and dynamic pruning. Our RBCT algorithm traverses the CluTree and finds more occurrences compared to the existing algorithms under the one-off condition in a linear time cost. Theoretical analysis and experimental results show that the RBCT algorithm outperforms other peers in retrieval precision and matching efficiency.

Keywords

Wildcard Matching One-Off CluTree RBCT 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kolpakov, R. and Kucherov, G., “Finding repeats with fixed gap,” in Proc. of the 7th Int’1 Symp. String Processing and Information Retrieval (SPIRE), Washington, IEEE Computer Society, pp. 162–168, 2000.Google Scholar
  2. 2.
    Lin Z., Lyu M.R., King I.: “MatchSim: a novel similarity measure based on maximum neighborhood matching,”. Knowledge and Information Systems, 32(1), 141–166 (2012)CrossRefGoogle Scholar
  3. 3.
    Kolpakov R., Kucherov G.: “Finding approximate repetitions under hamming distance,”. Theoretical Computer Science, 303(1), 135–156 (2003)CrossRefMATHMathSciNetGoogle Scholar
  4. 4.
    Anchuri P., Zaki M. J., Barkol O. et al.: “Graph mining for discovering infrastructure patterns in configuration management databases,”. Knowledge and Information Systems, 33(3), 491–522 (2012)CrossRefGoogle Scholar
  5. 5.
    Fischer, M. J., Paterson, M. S., “String matching and other products,” in Complexity of computation (Karp, RM ed.), 7, Massachusetts Institute of Technology, Cambridge, MA, USA, pp. 113–125, 1974.Google Scholar
  6. 6.
    Manber U., Baeza-Yates R.: “An algorithm for string matching with a sequence of don’t cares,”. Inf. Proc. Lett., 37(3), 133–136 (1991)CrossRefMATHMathSciNetGoogle Scholar
  7. 7.
    Muthukrishan, S. and Palem, K., “Non-standard stringology: Algorithms and complexity,” in Proc. of the 26th ACM Symposium on the Theory of Computing, ACM Press, New York, NY, USA, pp. 770–779, 1994.Google Scholar
  8. 8.
    Kucherov, G. and Rusinowitch, M., “Matching a set of strings with variable length don’t cares,” in Proc. of the 6th Symposium on Combinatorial Pattern Matching, Berlin: Springer, pp. 230–247, 1995.Google Scholar
  9. 9.
    Indyk, P., “Faster algorithms for string matching problems: Matching the convolution bound,” in Proc. of the 39th Symposium on Foundations of Computer Science, IEEE Computer Society, Washington, DC, USA, pp. 166–173, 1998.Google Scholar
  10. 10.
    Kalai, A., “Efficient pattern-matching with don’t cares,” in Proc. of the 13 th ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 655–656, 2002.Google Scholar
  11. 11.
    Cole, R., Gottlieb, L.A., Lewenstein, M., “Dictionary matching and indexing with errors and don’t cares,” in Proc. of the 36th ACM Symposium on the Theory of Computing, ACM Press, New York, NY, USA, pp.91–100, 2004.Google Scholar
  12. 12.
    Navarro, G. and Raffinot, M., Flexible pattern matching in strings: Practical on-line search algorithms for texts and biological sequences, Cambridge, UK: Cambridge University Press, 2002.Google Scholar
  13. 13.
    Navarro, G. and Raffinot, M., “Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching,” Computational Biology, 10, 6, pp. 903–923, 2003.Google Scholar
  14. 14.
    Chen G., Wu X., Zhu X., Arslan Abdullah N., He Y.: “Efficient String Matching with Wildcards and Length Constraints,”. Knowledge and Information Systems 10(4), 399–419 (2006)CrossRefGoogle Scholar
  15. 15.
    Hong, X., Wu, X., Hu, X., Liu, Y., Gao, J., Wu, G., “BPBM: An Algorithm for String Matching with Wildcards and Length Constraints,” PreMI’09&RSFDGrC’09, pp. 518–525, 2009.Google Scholar
  16. 16.
    Liu, Y., Wu, X., Hu, X., Gao, J., Wu, G., Wang, H., and Hong, X., “Pattern Matching with Wildcards based on Key Character Location,” Proc. of the 2009 IEEE International Conference in Information Reuse and Integration (IRI-2009), Las Vegas, USA, pp.167–170, 2009.Google Scholar
  17. 17.
    He, D., Arslan, Abdullah N., He, Y. and Wu, X., “Iterative Refinement of Repeat Sequence Specification Using Constrained Pattern Matching,” Proc. of the IEEE 7th International Symposium on Bioinformatics & Bioengineering (BIBE 2007), Harvard Medical School Conference Center, Cambridge - Boston, Massachusetts, USA, pp. 1199–1203, 2007.Google Scholar
  18. 18.
    Wu, Y., Wu, X., Min, F. and Li, Y., “A Nettree for Pattern Matching with Flexible Wildcard Constraints,” Proc. of the 11th IEEE International Conference on Information Reuse and Integration (IRI 2010), Las Vegas, USA, pp. 109–114, 2010.Google Scholar
  19. 19.
    Min, F., Wu, X. and Lu, Z., “Pattern Matching with Independent Wildcard Gaps,” Proc. of the 8th International Conference on Pervasive Intelligence and Computing (PICom 2009), Chengdu, China, pp. 194–199, 2009.Google Scholar
  20. 20.

Copyright information

© Ohmsha and Springer Japan 2014

Authors and Affiliations

  • Yingling Liu
    • 1
    • 2
  • Xindong Wu
    • 1
    • 3
  • Xue-gang Hu
    • 1
  • Jun Gao
    • 1
  1. 1.School of Computer Science & Information EngineeringHefei University of TechnologyHefeiChina
  2. 2.School of PhysicsUniversity of Science and Technology of ChinaHefeiChina
  3. 3.Department of Computer ScienceUniversity of VermontBurlingtonUSA

Personalised recommendations