Abstract
With genome sequencing projects producing huge amounts of sequence data, database sequence similarity search has become a central tool in bioinformatics to identify potentially homologous sequences. It is thus widely used as an initial step for sequence characterization and annotation, phylogeny, genomics, transcriptomies, and proteomics studies. Database similarity search is based upon sequence alignment methods also used in pairwise sequence comparison. Sequence alignment can be global (whole sequence alignment) or local (partial sequence alignment) and there are algorithms to find the optimal alignment given particular comparison criteria. However, as database searches require the comparison of the query sequence with every single sequence in the database, heuristic algorithms have been designed to reduce the time required to build an alignment that has a reasonable chance to be the best one. Such algorithms have been implemented as fast and efficient programs (Blast, FastA) available in different types to address different kinds of problems. After searching the appropriate database, similarity search programs produce a list of similar sequences and local alignments. These results should be carefully examined before coming to any conclusion, as many traps await the similarity seeker: paralogues, multidomain proteins, pseudogenes, etc. This chapter presents points that should always be kept in mind when performing database similarity searches for various goals. It ends with a practical example of sequence characterization from a single protein database search using Blast.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Rao, J. K. M. (1987) New scoring matrix for amino acid residue exchange based on residue characteristic physical parameters. Int. J. Peptide Protein Res. 29, 276–281.
Henikoff, S. and Henikoff, J. G. (1993) Performance evaluation of amino acid substitution matrices. Proteins: Structure Function Genet. 17, 49–61.
Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919.
Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. (1978) A model of evolutionary change in proteins. Atlas Protein Sequence Struct. 5, 345–352.
Altschul, S. F., Madden, T. L., Schaeffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402.
Pearson, W. R., and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448.
Gumbel, E. J. (1958) Statistics of Extremes. Columbia University Press, New York.
Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 443–453.
Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.
The UniProt Consortium. (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res. 35, D193–D197.
Pruitt, K. D., Tatusova, T., and Maglott, D. R. (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65.
Berman, H. M., Battistuz, T., Bhat, T. N., Bluhm, W. F., Bourne, P. E., Burkhardt, K., Feng, Z., Gilliland, G. L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneide, B., Thanki, N., Weissig, H., Westbrook, J. D., and Zardecki, C. (2002) The Protein Data Bank. Acta Crrystallogr. D Biol. Crystallogr. 58, 899–907.
Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R., and Wu, C. H. (2007) UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 282–288.
Wootton, J. C. and Federhen, S. (1993) Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17, 149–163.
Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L., Studholme, D. J., Yeats, C., and Eddy, S. R. (2004) The Pfam protein families database. Nucleic Acids Res. 32, D138–D141.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Humana Press, Totowa, NJ
About this protocol
Cite this protocol
Plewniak, F. (2008). Database Similarity Searches. In: Thompson, J.D., Ueffing, M., Schaeffer-Reiss, C. (eds) Functional Proteomics. Methods in Molecular Biology, vol 484. Humana Press. https://doi.org/10.1007/978-1-59745-398-1_24
Download citation
DOI: https://doi.org/10.1007/978-1-59745-398-1_24
Publisher Name: Humana Press
Print ISBN: 978-1-58829-971-0
Online ISBN: 978-1-59745-398-1
eBook Packages: Springer Protocols