Abstract
Pattern matching with wildcards and length constraints (PMWL) is a complex problem which has important applications in bioinformatics, network security and information retrieval. Existing algorithms use the traditional left-most strategy when selecting among multiple candidate matching positions, which leads to incomplete final matching results. This paper presents a new data structure CluTree and a new matching algorithm RBCT*1 based on CluTree. After establishing a cluster of trees with red and black nodes according to a pattern P and a text T, which is called CluTree, our RBCT algorithm uses the sharing degree, correlation degree and mixed information entropy of each node in the CluTree for path selection and dynamic pruning. Our RBCT algorithm traverses the CluTree and finds more occurrences compared to the existing algorithms under the one-off condition in a linear time cost. Theoretical analysis and experimental results show that the RBCT algorithm outperforms other peers in retrieval precision and matching efficiency.
Similar content being viewed by others
References
Kolpakov, R. and Kucherov, G., “Finding repeats with fixed gap,” in Proc. of the 7th Int’1 Symp. String Processing and Information Retrieval (SPIRE), Washington, IEEE Computer Society, pp. 162–168, 2000.
Lin Z., Lyu M.R., King I.: “MatchSim: a novel similarity measure based on maximum neighborhood matching,”. Knowledge and Information Systems, 32(1), 141–166 (2012)
Kolpakov R., Kucherov G.: “Finding approximate repetitions under hamming distance,”. Theoretical Computer Science, 303(1), 135–156 (2003)
Anchuri P., Zaki M. J., Barkol O. et al.: “Graph mining for discovering infrastructure patterns in configuration management databases,”. Knowledge and Information Systems, 33(3), 491–522 (2012)
Fischer, M. J., Paterson, M. S., “String matching and other products,” in Complexity of computation (Karp, RM ed.), 7, Massachusetts Institute of Technology, Cambridge, MA, USA, pp. 113–125, 1974.
Manber U., Baeza-Yates R.: “An algorithm for string matching with a sequence of don’t cares,”. Inf. Proc. Lett., 37(3), 133–136 (1991)
Muthukrishan, S. and Palem, K., “Non-standard stringology: Algorithms and complexity,” in Proc. of the 26th ACM Symposium on the Theory of Computing, ACM Press, New York, NY, USA, pp. 770–779, 1994.
Kucherov, G. and Rusinowitch, M., “Matching a set of strings with variable length don’t cares,” in Proc. of the 6th Symposium on Combinatorial Pattern Matching, Berlin: Springer, pp. 230–247, 1995.
Indyk, P., “Faster algorithms for string matching problems: Matching the convolution bound,” in Proc. of the 39th Symposium on Foundations of Computer Science, IEEE Computer Society, Washington, DC, USA, pp. 166–173, 1998.
Kalai, A., “Efficient pattern-matching with don’t cares,” in Proc. of the 13 th ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 655–656, 2002.
Cole, R., Gottlieb, L.A., Lewenstein, M., “Dictionary matching and indexing with errors and don’t cares,” in Proc. of the 36th ACM Symposium on the Theory of Computing, ACM Press, New York, NY, USA, pp.91–100, 2004.
Navarro, G. and Raffinot, M., Flexible pattern matching in strings: Practical on-line search algorithms for texts and biological sequences, Cambridge, UK: Cambridge University Press, 2002.
Navarro, G. and Raffinot, M., “Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching,” Computational Biology, 10, 6, pp. 903–923, 2003.
Chen G., Wu X., Zhu X., Arslan Abdullah N., He Y.: “Efficient String Matching with Wildcards and Length Constraints,”. Knowledge and Information Systems 10(4), 399–419 (2006)
Hong, X., Wu, X., Hu, X., Liu, Y., Gao, J., Wu, G., “BPBM: An Algorithm for String Matching with Wildcards and Length Constraints,” PreMI’09&RSFDGrC’09, pp. 518–525, 2009.
Liu, Y., Wu, X., Hu, X., Gao, J., Wu, G., Wang, H., and Hong, X., “Pattern Matching with Wildcards based on Key Character Location,” Proc. of the 2009 IEEE International Conference in Information Reuse and Integration (IRI-2009), Las Vegas, USA, pp.167–170, 2009.
He, D., Arslan, Abdullah N., He, Y. and Wu, X., “Iterative Refinement of Repeat Sequence Specification Using Constrained Pattern Matching,” Proc. of the IEEE 7th International Symposium on Bioinformatics & Bioengineering (BIBE 2007), Harvard Medical School Conference Center, Cambridge - Boston, Massachusetts, USA, pp. 1199–1203, 2007.
Wu, Y., Wu, X., Min, F. and Li, Y., “A Nettree for Pattern Matching with Flexible Wildcard Constraints,” Proc. of the 11th IEEE International Conference on Information Reuse and Integration (IRI 2010), Las Vegas, USA, pp. 109–114, 2010.
Min, F., Wu, X. and Lu, Z., “Pattern Matching with Independent Wildcard Gaps,” Proc. of the 8th International Conference on Pervasive Intelligence and Computing (PICom 2009), Chengdu, China, pp. 194–199, 2009.
Author information
Authors and Affiliations
Corresponding author
Additional information
*1RBCT stands for pattern matching with wildcards in a cluster of trees with Red and Black nodes called CluTree.
About this article
Cite this article
Liu, Y., Wu, X., Hu, Xg. et al. A Matching Algorithm in PMWL based on CluTree. New Gener. Comput. 32, 95–122 (2014). https://doi.org/10.1007/s00354-014-0201-3
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00354-014-0201-3