Advertisement

Knowledge and Information Systems

, Volume 49, Issue 1, pp 121–141 | Cite as

An efficient pruning strategy for approximate string matching over suffix tree

  • Huan Hu
  • Hongzhi WangEmail author
  • Jianzhong Li
  • Hong Gao
Regular Paper

Abstract

Approximate string matching over suffix tree with depth-first search (ASM_ST_DFS), a classical algorithm in the field of approximate string matching, was originally proposed by Ricardo A. Baeza-Yates and Gaston H. Gonnet in 1990. The algorithm is one of the most excellent algorithms for approximate string matching if combined with other indexing techniques. However, its time complexity is sensitive to the length of pattern string because it searches \(m+k\) characters on each path from the root before backtracking. In this paper, we propose an efficient pruning strategy to solve this problem. We prove its correctness and efficiency in theory. Particularly, we proved that if the pruning strategy is adopted, it averagely searches O(k) characters on each path before backtracking instead of O(m). Considering each internal node of suffix tree has multiple branches, the pruning strategy should work very well. We also experimentally show that when k is much smaller than m, the efficiency improves hundreds of times, and when k is not much smaller than m, it is still several times faster. This is the first paper that tries to solve the backtracking problem of ASM_ST_DFS in both theory and practice.

Keywords

Approximate string matching Suffix tree Depth-first Backtracking Dynamic programming Bit-parallelism 

Notes

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NSFC Grant 61472099, 61133002 and National Sci-Tech Support Plan 2015BAH10F00. Partial Experiments are conducted on the Xing cloud of the DEKE Lab, Renmin University of China.

References

  1. 1.
    Baeza-Yates RA, Gonnet GH (1999) A fast algorithm on average for all-against-all sequence matching. In: SPIRE/CRIWG. pp 16–23Google Scholar
  2. 2.
    Baeza-Yates RA, Navarro G (1999) Faster approximate string matching. Algorithmica 23(2):127–158MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Boytsov L (2011) Indexing methods for approximate dictionary searching: comparative analysis. ACM J Exp Algorithmics 16:1.1:1.1–1.1:1.91. doi: 10.1145/1963190.1963191
  4. 4.
    Bunke H, Bühler U (1993) Applications of approximate string matching to 2d shape recognition. Pattern Recognit 26(12):1797–1812CrossRefGoogle Scholar
  5. 5.
    Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1 1992, Proceedings, pp 175–184Google Scholar
  6. 6.
    Danek A, Deorowicz S, Grabowski S (2014) Indexing large genome collections on a PC. In: CoRR. arXiv:1403.7481
  7. 7.
    French JC, Powell AL, Schulman E (1997) Applications of approximate word matching in information retrieval. In: Proceedings of the sixth international conference on information and knowledge management (CIKM’97), Las Vegas, Nevada, November 10–14 1997, pp 9–15Google Scholar
  8. 8.
    Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29(2):147–160MathSciNetCrossRefGoogle Scholar
  9. 9.
    Hyyrö H, Fredriksson K, Navarro G (2005) Increased bit-parallelism for approximate and multiple string matching. J Exp Algorithmics 10:2.6. doi: 10.1145/1064546.1180617
  10. 10.
    Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710MathSciNetzbMATHGoogle Scholar
  11. 11.
    Mitra S, Acharya T, Luo J (2006) Data mining: multimedia, soft computing, and bioinformatics. J. Electron Imaging 15(1):019901CrossRefGoogle Scholar
  12. 12.
    Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88CrossRefGoogle Scholar
  13. 13.
    Navarro G, Baeza-Yates R (2000) A hybrid indexing method for approximate string matching. J Discrete Algorithms 1(1):205–239MathSciNetGoogle Scholar
  14. 14.
    Navarro G, Raffinot M (2002) Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  15. 15.
    Sellers PH (1980) The theory and computation of evolutionary distances: pattern recognition. J Algorithms 1(4):359–373MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems. In: ISMIR 2005, 6th international conference on music information retrieval, London, UK, 11–15 Sept 2005, Proceedings, pp 153–160Google Scholar
  17. 17.
    Ukkonen E (1985) Finding approximate patterns in strings. J Algorithms 6(1):132–137MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Ukkonen E (1993) Approximate string-matching over suffix trees. In: Combinatorial pattern matching, 4th annual symposium, CPM 93, Padova, Italy, 2–4 June 1993, Proceedings, pp 228–242Google Scholar
  19. 19.
    Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. PVLDB 6(13):1534–1545Google Scholar
  21. 21.
    Wu S, Manber U (1992) Fast text searching allowing errors. Commun ACM 35(10):83–91CrossRefGoogle Scholar
  22. 22.
    Yang X, Wang B, Li C, Wang J, Xie X (2013) Efficient direct search on compressed genomic data. In: 29th IEEE international conference on data engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp 961–972Google Scholar
  23. 23.
    Zhang Q, Chamberlain RD, Indeck RS, West BM, White J (2004) Massively parallel data mining using reconfigurable hardware: approximate string matching. In: 18th International parallel and distributed processing symposium (IPDPS 2004) CD-ROM/Abstracts proceedings, 26–30 April 2004. Santa Fe, New Mexico, USAGoogle Scholar
  24. 24.
    Zobel J, Dart PW (1996) Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’96, 18–22 August 1996, Zurich, Switzerland (Special Issue of the SIGIR Forum), pp 166–172Google Scholar

Copyright information

© Springer-Verlag London 2015

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations