Skip to main content
Log in

Space-efficient computation of parallel approximate string matching

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Approximate string matching (ASM) has a number of applications in many disciplines, ranging from information retrieval to gene matching. Conventional solution to this problem is based on the dynamic programming-based strategy having quadratic space and time complexity. The complexity of the conventional solution makes it impractical to search queries from the huge sequences having billions of characters. Therefore, many studies have been proposed that improves on the space and time requirement of the basic solution which includes heuristic, filtration, and index-based solutions. These existing solutions obtain the better performance by compromising on the completeness of the search. In this paper, we proposed the linear space algorithm for the approximate string matching problem while retaining the time complexity of conventional solution. The proposed method works in linear space without omitting any regions in the given text; hence, it finds all the possible matches. Conventional dynamic programming solution is modified in such a way that storage of complete trace back table is avoided by keeping only running count of each edit operation in the memory. A variety of laws and facts are discovered in classical dynamic programming table in that regard. We also presented the parallel approach to the proffered algorithm to improve the running time of the algorithm. The algorithm is evaluated on the CUDA-enabled GPUs. DNA sequences of sizes between 250 and 970 MBP are used for evaluation. Moreover, experiments are also performed by using natural language text to highlight the broader applicability of the proposed algorithm. Results show the substantial superiority of the algorithm in terms of performance and scalability compared to the state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

All the source code can be found publicly at the following GitHub link: (https://github.com/sadiqumair/Space-efficient-Computation-of-Parallel-Approximate-String-Matching). The open source existing datasets used in this article can be found from the following links: (1) NCBI (ftp.ncbi.nlm.nih.gov/), (2) Smart (https://github.com/smart-tool/smart).

References

  1. French JC, Powell AL, Schulman E (1997) Applications of approximate word matching in information retrieval. In: CIKM, vol 97, Citeseer, pp 9–15

  2. Jupin J, Shi JY (2014) Identity tracking in big data: preliminary research using in-memory data graph models for record linkage and probabilistic signature hashing for approximate string matching in big health and human services databases. In: Proceedings of the 2014 International Conference on Big Data Science and Computing, ACM, p 20

  3. Sandes EFDO, Boukerche A, Melo ACMAD (2016) Parallel optimal pairwise biological sequence comparison: algorithms, platforms, and classification. ACM Comput Surv (CSUR) 48(4):63

    Article  Google Scholar 

  4. Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  5. Watcharapinchai N, Rujikietgumjorn S (2017) Approximate license plate string matching for vehicle re-identification. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE, pp 1–6

  6. Alonso DG, Teyseyre A, Soria A, Berdun L (2020) Hand gesture recognition in real world scenarios using approximate string matching. Multimed Tools Appl 79(29):20773–20794

    Article  Google Scholar 

  7. Alba A, Mendez MO, Rubio-Rincon ME, Arce-Santana ER (2016) A consensus algorithm for approximate string matching and its application to QRS complex detection. Int J Mod Phys C 27(03):1650029

    Article  MathSciNet  Google Scholar 

  8. Hasan SS, Ahmed F, Khan RS (2015) Approximate string matching algorithms: a brief survey and comparison. Int J Comput Appl 120(8):1

    Google Scholar 

  9. Sellers PH (1980) The theory and computation of evolutionary distances: pattern recognition. J Algorithms 1(4):359–373

    Article  MathSciNet  MATH  Google Scholar 

  10. Hyyrö H (2005) Bit-parallel approximate string matching algorithms with transposition. J Discrete Algorithms 3(2–4):215–229

    Article  MathSciNet  MATH  Google Scholar 

  11. Myers G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM (JACM) 46(3):395–415

    Article  MathSciNet  MATH  Google Scholar 

  12. Weese D, Holtgrewe M, Reinert K (2012) Razers 3: faster, fully sensitive read mapping. Bioinformatics 28(20):2592–2599

    Article  Google Scholar 

  13. Cheng H, Jiang H, Yang J, Xu Y, Shang Y (2015) Bitmapper: an efficient all-mapper based on bit-vector computing. BMC Bioinform 16(1):1–16

    Article  Google Scholar 

  14. Fiori FJ, Pakalén W, Tarhio J (2022) Approximate string matching with SIMD. Comput J 65(6):1472–1488

    Article  MathSciNet  Google Scholar 

  15. Mitani Y, Ino F, Hagihara K (2016) Parallelizing exact and approximate string matching via inclusive scan on a GPU. IEEE Trans Parallel Distrib Syst 28(7):1989–2002

    Article  Google Scholar 

  16. Pevzner PA, Waterman MS (1995) Multiple filtration and approximate pattern matching. Algorithmica 13(1):135–154

    Article  MathSciNet  MATH  Google Scholar 

  17. Kim J, Li C, Xie X (2016) Hobbes3: dynamic generation of variable-length signatures for efficient approximate subsequence mappings. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), IEEE, pp 169–180

  18. Marco-Sola S, Sammeth M, Guigó R, Ribeca P (2012) The gem mapper: fast, accurate and versatile alignment by filtration. Nat Methods 9(12):1185–1188

    Article  Google Scholar 

  19. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):1–10

    Article  Google Scholar 

  20. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760

    Article  Google Scholar 

  21. Cheng H, Zhang Y, Xu Y (2018) Bitmapper2: a GPU-accelerated all-mapper based on the sparse q-gram index. IEEE/ACM Trans Comput Biol Bioinf 16(3):886–897

    Article  MathSciNet  Google Scholar 

  22. Tran NH, Chen X (2015) Amas: optimizing the partition and filtration of adaptive seeds to speed up read mapping. IEEE/ACM Trans Comput Biol Bioinf 13(4):623–633

    Article  Google Scholar 

  23. Fredriksson K, Navarro G (2004) Average-optimal single and multiple approximate string matching. J Exp Algorithmics (JEA) 9:1–4

    MathSciNet  MATH  Google Scholar 

  24. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv (CSUR) 33(1):31–88

    Article  Google Scholar 

  25. Ukkonen E (1985) Finding approximate patterns in strings. J Algorithms 6(1):132–137

    Article  MathSciNet  MATH  Google Scholar 

  26. Guo L, Du S, Ren M, Liu Y, Li J, He J, Tian N, Li K (2013) Parallel algorithm for approximate string matching with k differences. In: 2013 IEEE Eighth International Conference on Networking, Architecture and Storage, Washington, DC, USA, IEEE, pp 257–261

  27. Ho T, Oh S-R, Kim H (2018) New algorithms for fixed-length approximate string matching and approximate circular string matching under the hamming distance. J Supercomput 74(5):1815–1834

    Article  Google Scholar 

  28. Ibrahim OAS, Hamed BA, El-Hafeez TA (2022) A new fast technique for pattern matching in biological sequences. J Supercomput 2022:1–22

    Google Scholar 

  29. Landau GM, Vishkin U (1988) Fast string matching with k differences. J Comput Syst Sci 37(1):63–78

    Article  MathSciNet  MATH  Google Scholar 

  30. Galil Z, Park K (1990) An improved algorithm for approximate string matching. SIAM J Comput 19(6):989–999

    Article  MathSciNet  MATH  Google Scholar 

  31. Wu S, Manber U (1992) Fast text searching: allowing errors. Commun ACM 35(10):83–91

    Article  Google Scholar 

  32. Šošić M, Šikić M (2017) Edlib: a c/c++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33(9):1394–1395

    Article  Google Scholar 

  33. Porat B, Porat E (2009) Exact and approximate pattern matching in the streaming model. In: 2009 50th Annual IEEE Symposium on Foundations of Computer Science, IEEE, pp 315–323

  34. Liu Y, Guo L, Li J, Ren M, Li K (2012) Parallel algorithms for approximate string matching with k mismatches on CUDA. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IEEE, pp 2414–2422

  35. Ahmed P, Islam AS, Rahman MS (2013) A graph-theoretic model to solve the approximate string matching problem allowing for translocations. J Discrete Algorithms 23:143–156

    Article  MathSciNet  MATH  Google Scholar 

  36. Lipsky O, Porat B, Porat E, Shalom BR, Tzur A (2010) String matching with up to k swaps and mismatches. Inf Comput 208(9):1020–1030

    Article  MathSciNet  MATH  Google Scholar 

  37. Susik R (2017) Applying a q-gram based multiple string matching algorithm for approximate matching. In: Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska 7

  38. Kim H (2021) A k-mismatch string matching for generalized edit distance using diagonal skipping method. PLoS ONE 16(5):0251047

    Article  Google Scholar 

  39. Nakano K (2012) Efficient implementations of the approximate string matching on the memory machine models. In: 2012 Third International Conference on Networking and Computing, IEEE, pp 233–239

  40. Ho T, Oh S-R, Kim H (2017) A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations. PLoS ONE 12(10):0186251

    Article  Google Scholar 

  41. Sadiq MU, Yousaf MM, Aslam L, Aleem M, Sarwar S, Jaffry SW (2019) Nvpd: novel parallel edit distance algorithm, correctness, and performance evaluation. Cluster Comput. https://doi.org/10.1007/s10586-019-02962-w

    Article  Google Scholar 

  42. Hirschberg DS (1975) A linear space algorithm for computing maximal common subsequences. Commun ACM 18(6):341–343

    Article  MathSciNet  MATH  Google Scholar 

  43. Saccharomyces Genome Database. http://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna (2022)

  44. Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp SC (2010) mrsfast: a cache-oblivious algorithm for short-read mapping. Nat Methods 7(8):576–577

    Article  Google Scholar 

  45. Luo R, Wong T, Zhu J, Liu C-M, Zhu X, Wu E, Lee L-K, Lin H, Zhu W, Cheung DW et al (2013) Soap3-DP: fast, accurate and sensitive GPU-based short read aligner. PLoS ONE 8(5):65632

    Article  Google Scholar 

  46. Wagner RA, Fischer MJ (1974) The string-to-string correction problem. J ACM (JACM) 21(1):168–173. https://doi.org/10.1145/321796.321811

    Article  MathSciNet  MATH  Google Scholar 

  47. National Center for Biotechnology Information (NCBI). ftp://ftp.ncbi.nlm.nih.gov/ (2022)

  48. Faro S, Lecroq T, Borzì S, Mauro SD, Maggio A (2016) The string matching algorithms research tool. In: Holub J, Žďárek J (eds) Proceedings of the Prague Stringology Conference 2016, Czech Technical University in Prague, Czech Republic, pp 99–111

  49. Ayad LA, Pissis SP, Retha A (2016) libflasm: a software library for fixed-length approximate string matching. BMC Bioinform 17(1):1–12

    Article  Google Scholar 

Download references

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

MUS and MMY have contributed equally to the manuscript.

Corresponding author

Correspondence to Muhammad Umair Sadiq.

Ethics declarations

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Competing interests

The authors declare that they have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sadiq, M.U., Yousaf, M.M. Space-efficient computation of parallel approximate string matching. J Supercomput 79, 9093–9126 (2023). https://doi.org/10.1007/s11227-022-05038-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-05038-6

Keywords

Navigation