Skip to main content
Log in

Using Maximum Subarrays for Approximate String Matching

  • Published:
Annals of Data Science Aims and scope Submit manuscript

Abstract

In this paper, we evaluate maximum subarrays for approximate string matching and alignment. The global alignment score as well as local sub-alignments are indicators of good alignment. After showing how maximum sub-arrays could be used for string matching, we provide several ways of using maximum subarrays: long, short, loose, strict, and top-k. While long version extends the local sub-alignments, the short method avoids extensions that would not increase the alignment score. The loose method tries to achieve high global score whereas the strict method converts the output of loose alignment by minimizing the unnecessary gaps. The top-k method is used to find out top-k sub-alignments. The results are compared with two global and local dynamic programming methods that use gap penalties in addition to one of the state-of-art methods. In our experiments, using maximum subarrays generated good overall as well as local sub-alignments without requiring gap penalties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Altschul SF, Carroll RJ, Lipman DJ (1989) Weights for data related by a tree. J Mol Biol 207(4):647–653

    Article  Google Scholar 

  2. Altschul SF, Erickson BW (1986) Optimal sequence alignment using affine gap costs. Bull Math Biol 48(5–6):603–616

    Article  Google Scholar 

  3. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    Article  Google Scholar 

  4. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389

    Article  Google Scholar 

  5. Anisimova M, Cannarozzi G, Liberles D (2010) Finding the balance between the mathematical and biological optima in multiple sequence alignment. Trends Evol Biol 2(1):7

    Article  Google Scholar 

  6. Aygun RS (2007) S2s: structural-to-syntactic matching similar documents. Knowl Inf Syst 16(3):303–329

    Article  Google Scholar 

  7. Beebe NL, Clark JG (2007) Digital forensic text string searching: improving information retrieval effectiveness by thematically clustering search results. Dig Investig 4(Supplement):49–54

    Article  Google Scholar 

  8. Bentley JL (2000) Programming pearls. Addison-Wesley Professional, Reading

  9. Bille P, Gørtz IL, Vildhøj HW, Wind DK (2010) String matching with variable length gaps. In: Chavez E, Lonardi S (eds) String processing and information retrieval, number 6393 in Lecture Notes in Computer Science. Springer, Berlin, pp 385–394. doi:10.1007/978-3-642-16321-0_40

    Google Scholar 

  10. Breimer E, Goldberg M (2002) Learning significant alignments: an alternative to normalized local alignment. Springer, Berlin, pp 37–45

    Google Scholar 

  11. Brudno M, Malde S, Poliakov A, Do Chuong B, Couronne O, Dubchak I, Batzoglou S (2003) Glocal alignment: finding rearrangements during alignment. Bioinformatics 19(Suppl 1):i54–i62

    Article  Google Scholar 

  12. Choi Y (2012) A fast computation of pairwise sequence alignment scores between a protein and a set of single-locus variants of another protein. In: Proceedings of the ACM conference on bioinformatics, computational biology and biomedicine, BCB ’12. ACM, New York, pp 414–417

  13. Choi Y, Chan AP (2015) Provean web server: a tool to predict the functional effect of amino acid substitutions and indels. Bioinformatics 31(16):2745

    Article  Google Scholar 

  14. Clough P, Department Of Information Studies (2003) Old and new challenges in automatic plagiarism detection. In: National Plagiarism Advisory Service. http://ir.shef.ac.uk/cloughie/index.html, pp 391–407

  15. Eddy SR (2004) Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 22(8):1035–1036

    Article  Google Scholar 

  16. Feng X, Jin H, Zheng R, Zhu L, Dai W (2015) Accelerating Smith–Waterman alignment of species-based protein sequences on GPU. Int J Parallel Program 43(3):359–380

    Article  Google Scholar 

  17. Gondro C, Kinghorn BP (2007) A simple genetic algorithm for multiple sequence alignment. Genet Mol Res 6(4):964–982

    Google Scholar 

  18. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89(22):10915–10919

    Article  Google Scholar 

  19. Huang W, Umbach DM, Li L (2006) Accurate anchoring alignment of divergent sequences. Bioinformatics 22(1):29–34

    Article  Google Scholar 

  20. Jian Y, Xiu Y, Meng D (2010) Application of approximate string matching in video retrieval. In 2010 3rd international conference on advanced computer theory and engineering (ICACTE), vol 4, pp V4–348–V4–351

  21. Kandadi H, Aygun RS (2015) SEAL: a divide-and-conquer approach for sequence alignment. Netw Model Anal Health Inf Bioinform 4(1):1–11

    Article  Google Scholar 

  22. Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483

    Article  Google Scholar 

  23. Mount D (2004) Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press, New York

    Google Scholar 

  24. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

    Article  Google Scholar 

  25. Notredame C, Holm L, Higgins DG (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14(5):407–422

    Article  Google Scholar 

  26. Pearson WR (1990) [5]. In: Rapid and sensitive sequence comparison with FASTP and FASTA, vol 183. Academic Press, London, pp 63–98

  27. Peiravi A (2010) Application of string matching in Internet Security and Reliability. J Am Sci 6(1):25–33

    Google Scholar 

  28. Raad E, Chbeir R, Dipanda A (2010) User profile matching in social networks. In: 2010 13th international conference on network-based information systems (NBiS), pp 297–304

  29. SaiKrishna V, Rasool A, Khare N (2012) String matching and its applications in diversified fields. Int J Comput Sci Issues 9(1):219–226

    Google Scholar 

  30. Söding J (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics 21(7):951

    Article  Google Scholar 

  31. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197

    Article  Google Scholar 

  32. Stamm M, Staritzbichler R, Khafizov K, Forrest LR (2014) Alignmea membrane protein sequence alignment web server. Nucleic Acids Res 42(W1):W246

    Article  Google Scholar 

  33. Tang CL, Xie L, Koh IYY, Posy S, Alexov E, Honig B (2003) On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles. J Mol Biol 334(5):1043–1062

    Article  Google Scholar 

  34. Thompson JD, Higgins DG, Gibson TJ (1994) Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673

    Article  Google Scholar 

  35. Thompson JD, Higgins DG, Gibson TJ (1994) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Bioinformatics 10(1):19–29

    Article  Google Scholar 

  36. Vingron M (1996) Near-optimal sequence alignment. Curr Opin Struct Biol 6(3):346–352

    Article  Google Scholar 

  37. Zachariah MA, Crooks GE, Holbrook SR, Brenner SE (2005) A generalized affine gap model significantly improves protein sequence alignment accuracy. Proteins Struct Funct Bioinform 58(2):329–338

    Article  Google Scholar 

  38. Zhao G, Ling C, Sun D (2015) Sparksw: scalable distributed computing system for large-scale biological sequence alignment. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, pp 845–852

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ramazan S. Aygun.

Appendices

Appendix A: Detailed Experimental Results

Table 2 provides experimental results of pairwise alignments. The columns represent the methods used: long, short, loose, strict, top2 (the second best sub-alignment), top3 (the third best sub-alignment), Needleman–Wunsch (NW) and Smith–Waterman (SW). For each alignment, we provide its score and length. In Table 2), ‘scr’ provides the score of the overall alignment whereas ‘\(scr_s\)’ indicates the score of the best alignment (except for top-2 and top-3).

Table 2 Experimental results

Appendix B: Sample Alignments

In this section, we provide sample alignments by the algorithms mentioned in the paper. The base protein that we use for sample alignment is \(>gi|391863660|gb|EIT72965.1|\) G-protein alpha subunit [Aspergillus oryzae 3.042]

\(>gi|635505190|gb|KDE77264.1|\) G-protein [Aspergillus oryzae 100-8]:

MESERGIKPTSHCPQLSLGHVSYQAAQPGYKSSQCDQSPEKHTRGTNLAKGPPSRLLEKRQVDG

SKVTVLESLSNHLRQLLEVGEGRYDTRDSVDEEYAVMRETSDSVAQLLESLYVLPVKQGQGRMA

EICTDMTDIWGYVQKCGFLDEVKLECIDDGAEYLLNSLDRITEPNYIPTLEDTMWCYTKSTGIT

MARYTNGPSEVIFCDASGSRGERKKWGRIFDGATKVLYFVDAGSYDQCLTEEHNANRLAEELTL

FNSVCSTERLNHVEIVLFIHKMDKLERKLKTVPFDSTEIENWGTFSGDPQSVDDVKDYLYNTFS

AIAQKSSRSISVTFTSLRRPEEFGKTILSYASSVVNMV

We compare it with \(>gi|511009576|gb|EPB90817.1|\) guanine nucleotide-binding protein subunit alpha [Mucor circinelloides f. circinelloides 1006PhL]

\(>gi|511009576|gb|EPB90817.1|\) guanine nucleotide-binding protein subunit alpha [Mucor circinelloides f. circinelloides 1006PhL]

MGCCASVEESDSVGKLRNEEIDGQLRMEKLNNKNEVKLLLLGAGESGKSTILKQMKLIHDGGFT

PEEKETYKEIIFSNSVQSIHVLLEAMETLDIPLDDASNQGYYDYIMDQYQKMDYFSMPPELVKA

IRMLWQDKGVQEAHSRRNEFQLNDSASYYFDSIDRIGDPNYLPTDQDVLRSRVKTTGIAESKFT

FGTLTYRMFDVGGQRSERKKWIHCFEDVTAIIFLVAISEYDQVLIEDESVNRMQEALTLFDSIC

NSRWFERTSTILFLNKTDLFKQKLPTSPLVDYFPDYKGANDDYEEASNYIMQRFISLNSSAEKQ

VYTHLTCATDTEQVKFVMAAVNDIVLQTNLRDVGLI

The alignment for our loose method is given below and the sub-alignment is highlighted in gray.

figure l

The alignment for our strict method is given below and the sub-alignment is highlighted in gray.

figure m

The alignment for Needleman–Wunsch (with highlighted sub-alignment) is provided below.

figure n

The Smith–Waterman algorithm’s sub-alignment is given below:

figure o

The alignment of AlignMe is given below with sub-alignment highlighted:

figure p

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aygun, R.S. Using Maximum Subarrays for Approximate String Matching. Ann. Data. Sci. 4, 503–531 (2017). https://doi.org/10.1007/s40745-017-0117-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40745-017-0117-0

Keywords

Navigation