Aligning amino acid sequences: Comparison of commonly used methods
- 194 Downloads
We examined two extensive families of protein sequences using four different alignment schemes that employ various degrees of “weighting” in order to determine which approach is most sensitive in establishing relationships. All alignments used a similarity approach based on a general algorithm devised by Needleman and Wunsch. The approaches included a simple program, UM (unitary matrix), whereby only identities are scored; a scheme in which the genetic code is used as a basis for weighting (GC); another that employs a matrix based on structural similarity of amino acids taken together with the genetic basis of mutation (SG); and a fourth that uses the empirical log-odds matrix (LOM) developed by Dayhoff on the basis of observed amino acid replacements. The two sequence families examined were (a) nine different globins and (b) nine different tyrosine kinase-like proteins. It was assumed a priori that all members of a family share common ancestry. In cases where two sequences were more than 30% identical, alignments by all four methods were almost always the same. In cases where the percentage identity was less than 20%, however, there were often significant differences in the alignments. On the average, the Dayhoff LOM approach was the most effective in verifying distant relationships, as judged by an empirical “jumbling test.” This was not universally the case, however, and in some instances the simple UM was actually as good or better. Trees constructed on the basis of the various alignments differed with regard to their limb lengths, but had essentially the same branching orders. We suggest some reasons for the different effectivenesses of the four approaches in the two different sequence settings, and offer some rules of thumb for assessing the significance of sequence relationships.
Key wordsAmino acid sequence alignment Tyrosine kinases Globins Homologies
Unable to display preview. Download preview PDF.
- Barker WC, Dayhoff MO (1982) Viral src gene products are related to the catalytic chain of mammalian cAMP-dependent protein kinase. Proc Natl Sci USA 79:2836–2839Google Scholar
- Dayhoff MO (1972) A model of evolutionary change in proteins. Detecting distant relationships: computer methods and results. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5. National Biomedical Research Foundation, Washington, DC, pp 89–110Google Scholar
- Dayhoff MO (1978) A model of evolutionary change in proteins. Matriees for detecting distant relationships. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5, suppl 3. National Biomedical Research Foundation, Washington, DC, 345–358Google Scholar
- Doolittle RF (1979) Protein evolution. In: Neurath H, Hill RL (eds) The proteins, vol IV. Academic Press, New York, pp 1–118Google Scholar
- Fitch WM, Margoliash E (1967) Construction of phylogenetic trees. Science 15:279–284Google Scholar
- Fitch WM, Smith TF (1982) Implications of minimal length trees. Syst Zool 31:68–75Google Scholar
- Kernighan BW, Ritchie DM (1978) The C programming language. Prentice-Hall, Englewood Cliffs, New JerseyGoogle Scholar
- Ploegman JH, Drent G, Kalk KH, Hol WGJ, Heinrikson RL, Keim P, Weng L, Russell J (1978) The covalent and tertiary structure of bovine liver rhodanese. Nature 273:124–129Google Scholar
- Reddy EP, Smith MJ, Srinivasan A (1983) Nucleotide sequence of Abelson murine leukemia virus genome: structural similarity of its transforming gene product to otheronc gene products with tyrosine-specific kinase activity. Proc Natl Acad Sci USA 80:3623–3627, Proc Natl Acad Sci USA 80:7372 (correction)PubMedGoogle Scholar
- Suzuki T, Takagi T, Gotoh T (1982) Amino acid sequence of the smallest polypeptide chain containing heme of extracellular hemoglobin from the polychaeteTylorrhynchus heterochaetus. Biochim Biophys Acta 708:253–258Google Scholar