Background

The position of insertion/deletion mutations (indels) in molecular data sets can be useful phylogenetic information [14], yet this information is rarely used, especially in large data sets with many indels. There are three main reasons for this. First, some workers believe that indels may be unreliable as characters [5]. However, numerous studies in which indel characters were compared with already established tree topologies have found that these indels are reliable in constructing phylogenies [611]. Second, it can be very time-consuming to determine character states based on gaps and enter this information into a data matrix by hand. Third, there is disagreement as to the best method of defining homologous character states for indels. Several different methods for incorporating indels into phylogenetic analyses have been used. We discuss five of the most useful of these methods.

The computer program MALIGN uses the first of these methods of including indels in sequence alignment and phylogenetic analysis of sequences [12]. In this method, gap characters are considered to be a fifth character state for bases in DNA, as in Eernisse and Kluge [1]. Therefore, adjacent gap characters are considered independently of their neighbors, although subsequent gap characters after the first may be weighted less heavily to reflect the possibility of longer indel regions [12, 13]. Essentially, each individual gap position is considered as if it were a separate indel event. This is not very realistic. Insertion or deletion events often consist of multiple bases [1416]. Since many gap characters do not arise independently of one another, counting each gap character as a separate event causes indel events to be considered multiple times in determining phylogenetic relationships. This over-weights the indels and can distort phylogenies. Simmons and Ochoterena [16] also note a theoretical objection: because gaps are the product of the alignment procedure, and are not actually found in organisms or their sequences, sequences with gap characters do not have anything to compare with other sequences at the point where the gap occurs. For these reasons, gaps should not be considered as a fifth character state for nucleotide characters.

The second method, optimization alignment, is implemented in the program POY [17]. POY achieves a phylogenetic analysis, including indels as character state changes, without ever creating a multiple-sequence alignment. Allthough this avoids the major problems with MALIGN, it has a limitation. Indel changes may be weighted more heavily than substitutions, but the same weight is used for the determining the position of indels and phylogenetic analysis. For example, it is not possible to use a gap weight of 10 (an indel is equivalent to 10 substitutions), as is common in protein-coding regions, without also weighing that change 10 times as much as a substitution in phylogenetic analysis.

The third method to be considered is the multistate gap region method [4, 1820]. In this method, areas of overlapping indels, gap regions, are coded as individual characters. Different indels within each region are considered to be different states for the corresponding multistate gap region characters [4]. Within the DNA sequences, gap characters are coded as missing data, and the gap region characters are then placed at the end of each sequence. This method is useful because it does code indels as separate characters and does consider contiguous gap characters as related. However, the number of character states for each gap region can be quite large. Since there are so many different possible states, these characters can be less informative regarding relationships than other methods.

Simmons and Ochoterena have proposed a fourth method for coding indels [16]. This method is termed "simple indel coding". Similar to the third method, this process codes indels as separate characters in a data matrix, which is then considered along with the DNA base characters in phylogenetic analysis. Each indel with different start and/or end positions is considered to be a separate character, which all of the taxa under consideration either have or lack. If one of the indels completely overlaps an indel contained within another sequence, the sequences containing the longer indel are coded as being inapplicable for the shorter indel. This is done because it is impossible to determine whether or not the shorter indel is present in the sequences containing the longer one. Simple indel coding has the advantages of being conservative and easy to implement while still allowing indels to be highly informative in determining a correct phylogeny [16].

The final method for indel coding is also described by Simmons and Ochoterena [16]. This method is called complex indel coding. This method attempts to better account for the fact that indels are evolutionarily related to one another, and that an indel region may be modified through additional insertion/deletion events to yield a different indel region in another sequence. Complex indel coding, like simple indel coding, codes indels with different start and end positions as individual characters. However, overlapping indels may represent an evolutionary transition sequence [16]. Step matrices are constructed to accommodate this possibility. Complex indel coding utilizes more of the available information and never implies fewer steps than what is biologically realistic. However, this method generates some multi-state characters and step matrices and is thus more complicated to program. Also, the step matrices slow down phylogenetic programs. For a more thorough discussion of indels and their purpose in phylogenetic analysis, see reference [16].

Algorithm

The GapCoder program

Simple indel coding [16] was chosen for implementation because it is a relatively simple algorithm. In addition, simple indel coding does not make as many assumptions as complex indel coding. As a result, GapCoder should be acceptable to a wide range of researchers with different views about the exact nature of indels. GapCoder considers homologous indels or gaps to be those with the same start and end positions in the nucleotide sequences. Indels are not homologous if they have differing lengths, because it would take additional mutations to transform one into another [16]. GapCoder takes a pre-aligned PIR-format or modified FASTA-format file as input, and examines it to gather information about the positions of the indel regions. Figures 1 and 2 illustrate the two valid input file types. The first of these file types, the PIR-format file, can be automatically generated by programs such as ClustalX. The second file type, the modified FASTA-format file, is shown in Figure 2. This file differs from the standard FASTA-format by the inclusion of the two numbers at the top of the file. The first number is the number of taxa contained in the file, and the second number is the number of bases in each of the taxa. The taxon names and sequences are placed below the numbers. The output from GapCoder is a NEXUS-format file. The new characters created by the algorithm are placed at the end of the data. In addition, a table of correspondences between the indels and their codes is placed at the bottom of the file. If regions are excluded from an analysis, this table makes it easy to identify the corresponding indel characters for exclusion. An example output file corresponding to the input given in Fig. 1 or Fig. 2 is shown in Fig. 3. The indel characters coded by the program can be seen listed at the end. Each indel character can be in one of three states for each taxon: present, missing or inapplicable. The indel characters are coded with a '1' for present, '0' for missing, and '-' for inapplicable. When one or more indels are contained completely within a larger indel, all of the taxa that have the larger indel are coded with inapplicable ('-') characters for the smaller indels. For example, consider the first two indels listed in the correspondence table at the bottom of Figure 3. The table lists these indels as characters 17 and 18 at the ends of the sequences. The indel represented by character 17 occurs from characters 3–7 in the matrix. TaxonB and TaxonG have the indel in place of bases 3–7 and receive a '1' for character 17. Character 18 occurs from characters 4–5. TaxonC and TaxonH clearly have the indel and are scored as '1' for character 18. However, since the first indel completely covers the entire region of the second indel, it is unclear whether TaxonB or TaxonG could have had the first indel. Therefore, these taxa are given a '-' for character 18.

Figure 1
figure 1

Sample input file, PIR format

Figure 2
figure 2

Sample input file, modified FASTA format

Figure 3
figure 3

Sample output file. Output files are in the NEXUS format and ready to be input into PAUP or other programs that use this format. The indel characters have been added to the matrix and a table of correspondences is appended in the form of a comment, showing each indel character and the position of the indel upon which it is based. The Equate command allows 0 and 1 to be used, while maintaining the data type as 'DNA'. This allows one to perform maximum likelihood and other analyses that require this data type, though if a model of DNA substitution is applied, it may be most appropriate to exclude the indel characters from the analysis. They probably don't evolve according to the same model as substitutions.

Discussion

GapCoder has the potential to be useful in phylogenetics, especially in non-protein-coding regions where indels can be as plentiful as substitutions. Whenever multiple phylogenetic analyses are performed, or greater resolution is required, GapCoder provides an efficient way to incorporate the phylogenetic information contained in the indels. For example, the output resulting from GapCoder may be used in exploratory analyses of optimal DNA sequence alignment. Such an analysis would likely include GapCoder as part of an objective method with four stages. In the first stage, several alignments would be created using a program such as ClustalX. GapCoder would then be used to code the indels into the data matrix. Next, a phylogenetic analysis of the data would be performed using software such as PAUP. Finally, the best alignment could be chosen using the desired optimality criterion. GapCoder is also useful when different character sets and/or taxon sets are being explored, such as when different combinations of outgroups are tried. This often requires re-aligning the data set for each taxon set; GapCoder allows the indel characters to be quickly added each time.