Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences
Various measures of sequence dissimilarity have been evaluated by how well the additive least squares estimation of edges (branch lengths) of an unrooted evolutionary tree fit the observed pairwise dissimilarity measures and by how consistent the trees are for different data sets derived from the same set of sequences. This evaluation provided sensitive discrimination among dissimilarity measures and among possible trees. Dissimilarity measures not requiring prior sequence alignment did about as well as did the traditional mismatch counts requiring prior sequence alignment. Application of Jukes-Cantor correction to singlet mismatch counts worsened the results. Measures not requiring alignment had the advantage of being applicable to sequences too different to be critically alignable. Two different measures of pairwise dissimilarity not requiring alignment have been used: (1) multiplet distribution distance (MDD), the square of the Euclidean distance between vectors of the fractions of base singlets (or doublets, or triplets, or…) in the respective sequences, and (2) complements of long words (CLW), the count of bases not occurring in significantly long common words. MDD was applicable to sequences more different than was CLW (noncoding), but the latter often gave better results where both measures were available (coding). MDD results were improved by using longer multiplets and, if the sequences were coding, by using the larger amino acid and codon alphabets rather than the nucleotide alphabet. The additive least squares method could be used to provide a reasonable consensus of different trees for the same set of species (or related genes).
Key wordsEvolutionary trees Additive least squares DNA Greatly divergent sequences Consensus trees
Unable to display preview. Download preview PDF.
- Dayhoff MO (1979) Atlas of protein sequence and structure, vol 5, suppl 3. National Biomedical Research Foundation, Washington DC, p 8Google Scholar
- Dickerson RE, Geis I (1983) Hemoglobin: structure, function, evolution and pathology. Benjamin/Cummings, Menlo Park CA, p 93Google Scholar
- Felsenstein J (1986) PHYLIP—phylogeny inference package (version 2.9). Unversity of Washington, SeattleGoogle Scholar
- Felsenstein J (1987) PHYLIP Newsletter, number 9, May 1987Google Scholar
- Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro NH (ed) Mammalian protein metabolism. Academic Press, New York, pp 22–123Google Scholar
- Karlin S, Ost F, Blaisdell BE (1989) Patterns in DNA and amino acid sequences and their statistical significance. In: Waterman MS (ed) Mathematical methods for DNA sequences. CRC Press, Boca Raton FL (in press)Google Scholar
- Kittur SD, Hoppener JWM, Antonarakis SE, Daniels JDJ, Meyers DA, Maestri NE, Maarten J, Korneluk RG, Nelkin BD, Kazazian HH (1985) Linkage map of the shortarm of chromosome 11: location of the genes for catalase, calcitonin and insulin-like growth factor II. Proc Natl Acad Sci USA 82:5064–5067PubMedCrossRefGoogle Scholar
- Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:408–425Google Scholar