Abstract
In this paper, we evaluate maximum subarrays for approximate string matching and alignment. The global alignment score as well as local sub-alignments are indicators of good alignment. After showing how maximum sub-arrays could be used for string matching, we provide several ways of using maximum subarrays: long, short, loose, strict, and top-k. While long version extends the local sub-alignments, the short method avoids extensions that would not increase the alignment score. The loose method tries to achieve high global score whereas the strict method converts the output of loose alignment by minimizing the unnecessary gaps. The top-k method is used to find out top-k sub-alignments. The results are compared with two global and local dynamic programming methods that use gap penalties in addition to one of the state-of-art methods. In our experiments, using maximum subarrays generated good overall as well as local sub-alignments without requiring gap penalties.
Similar content being viewed by others
References
Altschul SF, Carroll RJ, Lipman DJ (1989) Weights for data related by a tree. J Mol Biol 207(4):647–653
Altschul SF, Erickson BW (1986) Optimal sequence alignment using affine gap costs. Bull Math Biol 48(5–6):603–616
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389
Anisimova M, Cannarozzi G, Liberles D (2010) Finding the balance between the mathematical and biological optima in multiple sequence alignment. Trends Evol Biol 2(1):7
Aygun RS (2007) S2s: structural-to-syntactic matching similar documents. Knowl Inf Syst 16(3):303–329
Beebe NL, Clark JG (2007) Digital forensic text string searching: improving information retrieval effectiveness by thematically clustering search results. Dig Investig 4(Supplement):49–54
Bentley JL (2000) Programming pearls. Addison-Wesley Professional, Reading
Bille P, Gørtz IL, Vildhøj HW, Wind DK (2010) String matching with variable length gaps. In: Chavez E, Lonardi S (eds) String processing and information retrieval, number 6393 in Lecture Notes in Computer Science. Springer, Berlin, pp 385–394. doi:10.1007/978-3-642-16321-0_40
Breimer E, Goldberg M (2002) Learning significant alignments: an alternative to normalized local alignment. Springer, Berlin, pp 37–45
Brudno M, Malde S, Poliakov A, Do Chuong B, Couronne O, Dubchak I, Batzoglou S (2003) Glocal alignment: finding rearrangements during alignment. Bioinformatics 19(Suppl 1):i54–i62
Choi Y (2012) A fast computation of pairwise sequence alignment scores between a protein and a set of single-locus variants of another protein. In: Proceedings of the ACM conference on bioinformatics, computational biology and biomedicine, BCB ’12. ACM, New York, pp 414–417
Choi Y, Chan AP (2015) Provean web server: a tool to predict the functional effect of amino acid substitutions and indels. Bioinformatics 31(16):2745
Clough P, Department Of Information Studies (2003) Old and new challenges in automatic plagiarism detection. In: National Plagiarism Advisory Service. http://ir.shef.ac.uk/cloughie/index.html, pp 391–407
Eddy SR (2004) Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 22(8):1035–1036
Feng X, Jin H, Zheng R, Zhu L, Dai W (2015) Accelerating Smith–Waterman alignment of species-based protein sequences on GPU. Int J Parallel Program 43(3):359–380
Gondro C, Kinghorn BP (2007) A simple genetic algorithm for multiple sequence alignment. Genet Mol Res 6(4):964–982
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89(22):10915–10919
Huang W, Umbach DM, Li L (2006) Accurate anchoring alignment of divergent sequences. Bioinformatics 22(1):29–34
Jian Y, Xiu Y, Meng D (2010) Application of approximate string matching in video retrieval. In 2010 3rd international conference on advanced computer theory and engineering (ICACTE), vol 4, pp V4–348–V4–351
Kandadi H, Aygun RS (2015) SEAL: a divide-and-conquer approach for sequence alignment. Netw Model Anal Health Inf Bioinform 4(1):1–11
Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483
Mount D (2004) Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press, New York
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Notredame C, Holm L, Higgins DG (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14(5):407–422
Pearson WR (1990) [5]. In: Rapid and sensitive sequence comparison with FASTP and FASTA, vol 183. Academic Press, London, pp 63–98
Peiravi A (2010) Application of string matching in Internet Security and Reliability. J Am Sci 6(1):25–33
Raad E, Chbeir R, Dipanda A (2010) User profile matching in social networks. In: 2010 13th international conference on network-based information systems (NBiS), pp 297–304
SaiKrishna V, Rasool A, Khare N (2012) String matching and its applications in diversified fields. Int J Comput Sci Issues 9(1):219–226
Söding J (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics 21(7):951
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Stamm M, Staritzbichler R, Khafizov K, Forrest LR (2014) Alignmea membrane protein sequence alignment web server. Nucleic Acids Res 42(W1):W246
Tang CL, Xie L, Koh IYY, Posy S, Alexov E, Honig B (2003) On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles. J Mol Biol 334(5):1043–1062
Thompson JD, Higgins DG, Gibson TJ (1994) Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673
Thompson JD, Higgins DG, Gibson TJ (1994) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Bioinformatics 10(1):19–29
Vingron M (1996) Near-optimal sequence alignment. Curr Opin Struct Biol 6(3):346–352
Zachariah MA, Crooks GE, Holbrook SR, Brenner SE (2005) A generalized affine gap model significantly improves protein sequence alignment accuracy. Proteins Struct Funct Bioinform 58(2):329–338
Zhao G, Ling C, Sun D (2015) Sparksw: scalable distributed computing system for large-scale biological sequence alignment. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, pp 845–852
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Detailed Experimental Results
Table 2 provides experimental results of pairwise alignments. The columns represent the methods used: long, short, loose, strict, top2 (the second best sub-alignment), top3 (the third best sub-alignment), Needleman–Wunsch (NW) and Smith–Waterman (SW). For each alignment, we provide its score and length. In Table 2), ‘scr’ provides the score of the overall alignment whereas ‘\(scr_s\)’ indicates the score of the best alignment (except for top-2 and top-3).
Appendix B: Sample Alignments
In this section, we provide sample alignments by the algorithms mentioned in the paper. The base protein that we use for sample alignment is \(>gi|391863660|gb|EIT72965.1|\) G-protein alpha subunit [Aspergillus oryzae 3.042]
\(>gi|635505190|gb|KDE77264.1|\) G-protein [Aspergillus oryzae 100-8]:
MESERGIKPTSHCPQLSLGHVSYQAAQPGYKSSQCDQSPEKHTRGTNLAKGPPSRLLEKRQVDG
SKVTVLESLSNHLRQLLEVGEGRYDTRDSVDEEYAVMRETSDSVAQLLESLYVLPVKQGQGRMA
EICTDMTDIWGYVQKCGFLDEVKLECIDDGAEYLLNSLDRITEPNYIPTLEDTMWCYTKSTGIT
MARYTNGPSEVIFCDASGSRGERKKWGRIFDGATKVLYFVDAGSYDQCLTEEHNANRLAEELTL
FNSVCSTERLNHVEIVLFIHKMDKLERKLKTVPFDSTEIENWGTFSGDPQSVDDVKDYLYNTFS
AIAQKSSRSISVTFTSLRRPEEFGKTILSYASSVVNMV
We compare it with \(>gi|511009576|gb|EPB90817.1|\) guanine nucleotide-binding protein subunit alpha [Mucor circinelloides f. circinelloides 1006PhL]
\(>gi|511009576|gb|EPB90817.1|\) guanine nucleotide-binding protein subunit alpha [Mucor circinelloides f. circinelloides 1006PhL]
MGCCASVEESDSVGKLRNEEIDGQLRMEKLNNKNEVKLLLLGAGESGKSTILKQMKLIHDGGFT
PEEKETYKEIIFSNSVQSIHVLLEAMETLDIPLDDASNQGYYDYIMDQYQKMDYFSMPPELVKA
IRMLWQDKGVQEAHSRRNEFQLNDSASYYFDSIDRIGDPNYLPTDQDVLRSRVKTTGIAESKFT
FGTLTYRMFDVGGQRSERKKWIHCFEDVTAIIFLVAISEYDQVLIEDESVNRMQEALTLFDSIC
NSRWFERTSTILFLNKTDLFKQKLPTSPLVDYFPDYKGANDDYEEASNYIMQRFISLNSSAEKQ
VYTHLTCATDTEQVKFVMAAVNDIVLQTNLRDVGLI
The alignment for our loose method is given below and the sub-alignment is highlighted in gray.
The alignment for our strict method is given below and the sub-alignment is highlighted in gray.
The alignment for Needleman–Wunsch (with highlighted sub-alignment) is provided below.
The Smith–Waterman algorithm’s sub-alignment is given below:
The alignment of AlignMe is given below with sub-alignment highlighted:
Rights and permissions
About this article
Cite this article
Aygun, R.S. Using Maximum Subarrays for Approximate String Matching. Ann. Data. Sci. 4, 503–531 (2017). https://doi.org/10.1007/s40745-017-0117-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40745-017-0117-0