Using Maximum Subarrays for Approximate String Matching

Aygun, Ramazan S.

doi:10.1007/s40745-017-0117-0

Using Maximum Subarrays for Approximate String Matching

Published: 19 July 2017

Volume 4, pages 503–531, (2017)
Cite this article

Annals of Data Science Aims and scope Submit manuscript

Ramazan S. Aygun¹

119 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, we evaluate maximum subarrays for approximate string matching and alignment. The global alignment score as well as local sub-alignments are indicators of good alignment. After showing how maximum sub-arrays could be used for string matching, we provide several ways of using maximum subarrays: long, short, loose, strict, and top-k. While long version extends the local sub-alignments, the short method avoids extensions that would not increase the alignment score. The loose method tries to achieve high global score whereas the strict method converts the output of loose alignment by minimizing the unnecessary gaps. The top-k method is used to find out top-k sub-alignments. The results are compared with two global and local dynamic programming methods that use gap penalties in addition to one of the state-of-art methods. In our experiments, using maximum subarrays generated good overall as well as local sub-alignments without requiring gap penalties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Space-Efficient Approximate String Matching Allowing Inversions in Fast Average Time

Approximating LZ77 via Small-Space Multiple-Pattern Matching

An In-place Framework for Exact and Approximate Shortest Unique Substring Queries

References

Altschul SF, Carroll RJ, Lipman DJ (1989) Weights for data related by a tree. J Mol Biol 207(4):647–653
Article Google Scholar
Altschul SF, Erickson BW (1986) Optimal sequence alignment using affine gap costs. Bull Math Biol 48(5–6):603–616
Article Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Article Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389
Article Google Scholar
Anisimova M, Cannarozzi G, Liberles D (2010) Finding the balance between the mathematical and biological optima in multiple sequence alignment. Trends Evol Biol 2(1):7
Article Google Scholar
Aygun RS (2007) S2s: structural-to-syntactic matching similar documents. Knowl Inf Syst 16(3):303–329
Article Google Scholar
Beebe NL, Clark JG (2007) Digital forensic text string searching: improving information retrieval effectiveness by thematically clustering search results. Dig Investig 4(Supplement):49–54
Article Google Scholar
Bentley JL (2000) Programming pearls. Addison-Wesley Professional, Reading
Bille P, Gørtz IL, Vildhøj HW, Wind DK (2010) String matching with variable length gaps. In: Chavez E, Lonardi S (eds) String processing and information retrieval, number 6393 in Lecture Notes in Computer Science. Springer, Berlin, pp 385–394. doi:10.1007/978-3-642-16321-0_40
Google Scholar
Breimer E, Goldberg M (2002) Learning significant alignments: an alternative to normalized local alignment. Springer, Berlin, pp 37–45
Google Scholar
Brudno M, Malde S, Poliakov A, Do Chuong B, Couronne O, Dubchak I, Batzoglou S (2003) Glocal alignment: finding rearrangements during alignment. Bioinformatics 19(Suppl 1):i54–i62
Article Google Scholar
Choi Y (2012) A fast computation of pairwise sequence alignment scores between a protein and a set of single-locus variants of another protein. In: Proceedings of the ACM conference on bioinformatics, computational biology and biomedicine, BCB ’12. ACM, New York, pp 414–417
Choi Y, Chan AP (2015) Provean web server: a tool to predict the functional effect of amino acid substitutions and indels. Bioinformatics 31(16):2745
Article Google Scholar
Clough P, Department Of Information Studies (2003) Old and new challenges in automatic plagiarism detection. In: National Plagiarism Advisory Service. http://ir.shef.ac.uk/cloughie/index.html, pp 391–407
Eddy SR (2004) Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 22(8):1035–1036
Article Google Scholar
Feng X, Jin H, Zheng R, Zhu L, Dai W (2015) Accelerating Smith–Waterman alignment of species-based protein sequences on GPU. Int J Parallel Program 43(3):359–380
Article Google Scholar
Gondro C, Kinghorn BP (2007) A simple genetic algorithm for multiple sequence alignment. Genet Mol Res 6(4):964–982
Google Scholar
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89(22):10915–10919
Article Google Scholar
Huang W, Umbach DM, Li L (2006) Accurate anchoring alignment of divergent sequences. Bioinformatics 22(1):29–34
Article Google Scholar
Jian Y, Xiu Y, Meng D (2010) Application of approximate string matching in video retrieval. In 2010 3rd international conference on advanced computer theory and engineering (ICACTE), vol 4, pp V4–348–V4–351
Kandadi H, Aygun RS (2015) SEAL: a divide-and-conquer approach for sequence alignment. Netw Model Anal Health Inf Bioinform 4(1):1–11
Article Google Scholar
Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483
Article Google Scholar
Mount D (2004) Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press, New York
Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Article Google Scholar
Notredame C, Holm L, Higgins DG (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14(5):407–422
Article Google Scholar
Pearson WR (1990) [5]. In: Rapid and sensitive sequence comparison with FASTP and FASTA, vol 183. Academic Press, London, pp 63–98
Peiravi A (2010) Application of string matching in Internet Security and Reliability. J Am Sci 6(1):25–33
Google Scholar
Raad E, Chbeir R, Dipanda A (2010) User profile matching in social networks. In: 2010 13th international conference on network-based information systems (NBiS), pp 297–304
SaiKrishna V, Rasool A, Khare N (2012) String matching and its applications in diversified fields. Int J Comput Sci Issues 9(1):219–226
Google Scholar
Söding J (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics 21(7):951
Article Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Article Google Scholar
Stamm M, Staritzbichler R, Khafizov K, Forrest LR (2014) Alignmea membrane protein sequence alignment web server. Nucleic Acids Res 42(W1):W246
Article Google Scholar
Tang CL, Xie L, Koh IYY, Posy S, Alexov E, Honig B (2003) On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles. J Mol Biol 334(5):1043–1062
Article Google Scholar
Thompson JD, Higgins DG, Gibson TJ (1994) Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673
Article Google Scholar
Thompson JD, Higgins DG, Gibson TJ (1994) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Bioinformatics 10(1):19–29
Article Google Scholar
Vingron M (1996) Near-optimal sequence alignment. Curr Opin Struct Biol 6(3):346–352
Article Google Scholar
Zachariah MA, Crooks GE, Holbrook SR, Brenner SE (2005) A generalized affine gap model significantly improves protein sequence alignment accuracy. Proteins Struct Funct Bioinform 58(2):329–338
Article Google Scholar
Zhao G, Ling C, Sun D (2015) Sparksw: scalable distributed computing system for large-scale biological sequence alignment. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, pp 845–852

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Alabama in Huntsville, Huntsville, AL, 35899, USA
Ramazan S. Aygun

Authors

Ramazan S. Aygun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ramazan S. Aygun.

Appendices

Appendix A: Detailed Experimental Results

Table 2 provides experimental results of pairwise alignments. The columns represent the methods used: long, short, loose, strict, top2 (the second best sub-alignment), top3 (the third best sub-alignment), Needleman–Wunsch (NW) and Smith–Waterman (SW). For each alignment, we provide its score and length. In Table 2), ‘scr’ provides the score of the overall alignment whereas ‘\(scr_s\)’ indicates the score of the best alignment (except for top-2 and top-3).

Table 2 Experimental results

Full size table

Appendix B: Sample Alignments

In this section, we provide sample alignments by the algorithms mentioned in the paper. The base protein that we use for sample alignment is \(>gi|391863660|gb|EIT72965.1|\) G-protein alpha subunit [Aspergillus oryzae 3.042]

\(>gi|635505190|gb|KDE77264.1|\) G-protein [Aspergillus oryzae 100-8]:

MESERGIKPTSHCPQLSLGHVSYQAAQPGYKSSQCDQSPEKHTRGTNLAKGPPSRLLEKRQVDG

SKVTVLESLSNHLRQLLEVGEGRYDTRDSVDEEYAVMRETSDSVAQLLESLYVLPVKQGQGRMA

EICTDMTDIWGYVQKCGFLDEVKLECIDDGAEYLLNSLDRITEPNYIPTLEDTMWCYTKSTGIT

MARYTNGPSEVIFCDASGSRGERKKWGRIFDGATKVLYFVDAGSYDQCLTEEHNANRLAEELTL

FNSVCSTERLNHVEIVLFIHKMDKLERKLKTVPFDSTEIENWGTFSGDPQSVDDVKDYLYNTFS

AIAQKSSRSISVTFTSLRRPEEFGKTILSYASSVVNMV

We compare it with \(>gi|511009576|gb|EPB90817.1|\) guanine nucleotide-binding protein subunit alpha [Mucor circinelloides f. circinelloides 1006PhL]

\(>gi|511009576|gb|EPB90817.1|\) guanine nucleotide-binding protein subunit alpha [Mucor circinelloides f. circinelloides 1006PhL]

MGCCASVEESDSVGKLRNEEIDGQLRMEKLNNKNEVKLLLLGAGESGKSTILKQMKLIHDGGFT

PEEKETYKEIIFSNSVQSIHVLLEAMETLDIPLDDASNQGYYDYIMDQYQKMDYFSMPPELVKA

IRMLWQDKGVQEAHSRRNEFQLNDSASYYFDSIDRIGDPNYLPTDQDVLRSRVKTTGIAESKFT

FGTLTYRMFDVGGQRSERKKWIHCFEDVTAIIFLVAISEYDQVLIEDESVNRMQEALTLFDSIC

NSRWFERTSTILFLNKTDLFKQKLPTSPLVDYFPDYKGANDDYEEASNYIMQRFISLNSSAEKQ

VYTHLTCATDTEQVKFVMAAVNDIVLQTNLRDVGLI

The alignment for our loose method is given below and the sub-alignment is highlighted in gray.

The alignment for our strict method is given below and the sub-alignment is highlighted in gray.

The alignment for Needleman–Wunsch (with highlighted sub-alignment) is provided below.

The Smith–Waterman algorithm’s sub-alignment is given below:

The alignment of AlignMe is given below with sub-alignment highlighted:

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aygun, R.S. Using Maximum Subarrays for Approximate String Matching. Ann. Data. Sci. 4, 503–531 (2017). https://doi.org/10.1007/s40745-017-0117-0

Download citation

Received: 17 January 2017
Revised: 05 May 2017
Accepted: 13 July 2017
Published: 19 July 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s40745-017-0117-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Maximum Subarrays for Approximate String Matching

Abstract

Access this article

Similar content being viewed by others

Space-Efficient Approximate String Matching Allowing Inversions in Fast Average Time

Approximating LZ77 via Small-Space Multiple-Pattern Matching

An In-place Framework for Exact and Approximate Shortest Unique Substring Queries

References