Abstract
We apply the concept of subset seeds proposed in ? to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Kucherov, G., Noé, L., Roytberg, M.: A unifying framework for seed sensitivity and its application to subset seeds. JBCB 4(2), 553–570 (2006)
Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic Local Alignment Search Tool. Journal of Molecular Biology 215, 403–410 (1990)
Altschul, S., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Brown, D.: Optimizing multiple seed for protein homology search. IEEE/ACM TCBB 2(1), 29–38 (2004) (earlier version in WABI 2004)
Ma, B., Tromp, J., Li, M.: PatternHunter: Faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)
Li, M., Ma, B., Kisman, D., Tromp, J.: PatternHunter II: Highly sensitive and fast homology search. JBCB 2(3), 417–439 (2004) (earlier version in GIW 2003)
Brejova, B., Brown, D., Vinar, T.: Vector seeds: an extension to spaced seeds. Journal of Computer and System Sciences 70(3), 364–380 (2005)
Noé, L., Kucherov, G.: YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acid Res. 33, W540–W543 (2005)
Mak, D., Gelfand, Y., Benson, G.: Indel seeds for homology search. Bioinformatics 22(14), e341–e349 (2006)
Csürös, M., Ma, B.: Rapid homology search with neighbor seeds. Algorithmica 48(2), 187–202 (2007)
Zhou, L., Stanton, J., Florea, L.: Universal seeds for cDNA-to-genome comparison. BMC Bioinformatics 9(36) (2008)
Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. In: RECOMB, pp. 76–84 (2004)
Kucherov, G., Noé, L., Roytberg, M.: Multi-seed lossless filtration. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 297–310. Springer, Heidelberg (2004)
Yang, I.H., et al.: Efficient methods for generating optimal single and multiple spaced seeds. In: IEEE BIBE, pp. 411–416 (2004)
Xu, J., Brown, D., Li, M., Ma, B.: Optimizing multiple spaced seeds for homology search. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 47–58. Springer, Heidelberg (2004)
Kisman, D., Li, M., Ma, B., Wang, L.: tPatternHunter: gapped, fast and sensitive translated homology search. Bioinformatics 21(4), 542–544 (2005)
Peterlongo, P., et al.: Protein similarity search with subset seeds on a dedicated reconfigurable hardware. In: PBC. LNCS, vol. 4967 (2007)
Noé, L., Kucherov, G.: Improved hit criteria for DNA local alignment. BMC Bioinformatics 5(149) (2004)
Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Applied Mathematics 138(3), 253–263 (2004) (earlier version in 2002)
Li, T., Fan, K., Wang, J., Wang, W.: Reduction of protein sequence complexity by residue grouping. Journal of Protein Engineering 16, 323–330 (2003)
Murphy, L., Wallqvist, A., Levy, R.: Simplified amino acid alphabets for protein fold recognition and implications for folding. J. of Prot. Eng. 13, 149–152 (2000)
Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992)
Henikoff, S., Henikoff, J.: Automated assembly of protein blocks for database searching. Nucleic Acids Res. 19(23), 6565–6572 (1991)
Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: RECOMB, pp. 67–75 (2003)
Ilie, L., Ilie, S.: Long spaced seeds for finding similarities between biological sequences. In: BIOCOMP, pp. 3–8 (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Roytberg, M. et al. (2008). Efficient Seeding Techniques for Protein Similarity Search. In: Elloumi, M., Küng, J., Linial, M., Murphy, R.F., Schneider, K., Toma, C. (eds) Bioinformatics Research and Development. BIRD 2008. Communications in Computer and Information Science, vol 13. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70600-7_36
Download citation
DOI: https://doi.org/10.1007/978-3-540-70600-7_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70598-7
Online ISBN: 978-3-540-70600-7
eBook Packages: Computer ScienceComputer Science (R0)