Findings

Background

Multiple sequence alignments (MSAs) are the basis of comparative analyses that rely on sequence homology [14]. Alignments of homologous sequences are used to characterize protein domains, predict protein function, detect motifs and describe gene families, as well as to infer evolutionary relationships between species. However, often there are sections in MSAs that can contain sequences that are erroneously aligned. These correspond to regions that are i) under a rapid evolutionary rate, ii) non-homologous because of the choice of different splicing variants in the comparison between species, iii) wrongly annotated intron-exon barriers, iv) local structural rearrangements in a single species, etc. It is difficult to classify portions of an alignment as either very divergent or non-homologous but it has been shown that phylogenetic results are improved after removing divergent and ambiguously aligned blocks from protein sequence alignments [5]. Sequences in a MSA should be neither so similar that they are devoid of variation among the sites nor so divergent that positions are saturated by multiple substitutions, especially for phylogenetic analyses [6, 7]. Some methods have been developed to automatically clean alignments, but they mostly work by removing complete sequences if determined to be unrelated [8] or by deleting complete columns of the MSA [5]. Other approaches should be taken into account, such as the one used by Guidance [9], which can detect problematic sections of individual sequences located within regions with high alignment uncertainty (e.g. Figure 1). Alternatively, manual adjustment can be performed to remove or mask potential non-homologous regions by removing a minimum amount of sequence information, but this leads to biases and irreproducibility of the results and is impractical for large-scale genomic analyses.

Figure 1
figure 1

Example of the windows identified by DivA. Outlier windows determined by DivA are shown in black boxes. MEGA5 [10] was used to display the alignment view (the option to toggle off sites with a conservation score of more than 80% was used for an easier visualization of the outlier amino acids).

We developed DivA, a method that uses a sliding window approach to detect sections of individual sequences that are very divergent to the rest of the alignment. DivA first calculates the distribution of four parameters for each sequence in each window based on sequence similarity using both sequence weighting with position specific counts and distance-based methods. The windows for each sequence are then classified as homologous or very divergent depending on automatically calculated thresholds based on the outlier values of each parameter determined using all windows from all alignments (Figure 2). The user can change the stringency of the thresholds by choosing a desired standard deviation of the parameter values. The output is a table containing the coordinates of the very divergent windows and the values of the parameters (Table 1). Also, the user has the option to output an alignment where the outlier segments are masked.

Figure 2
figure 2

DivA’s workflow. Using a sliding window approach, four parameters are calculated for every sequence in every window. If there are conserved sites at the edge of each window those are trimmed and the parameter values for the window are recalculated. The threshold values for each parameter are then calculated and used to classify each window in each sequence as very divergent (potentially non-homologous) or truly homologous. Sequences from overlapping windows classified as outlier are merged, and the final coordinates are provided in the output file. The user can also obtain a new alignment file where the outlier windows are masked.

Table 1 Example of output from DivA for the Test alignment

Datasets

To test the performance of DivA we used 200 MSAs with sets of orthologous proteins generated by a recent avian phylogenomics project [11] and corresponding manual annotations of highly divergent sequence segments [see Additional file 1]. The 200 MSAs were chosen randomly from the 8295 orthologous sets in Jarvis et al[12] (this paper presents whole genome data and the corresponding annotations for 48 bird species representing 36 orders of birds). Each MSA contains protein sequences from up to 48 birds and the corresponding orthologues from the more distantly related species Homo sapiens, Alligator mississippiensis, Chelonia mydas, and Anolis carolinensis (human, alligator, turtle, and lizard, respectively). DivA was run with the default program parameters on all the alignments, both including and excluding non-birds species. We also created subsets containing 50, 100 and 200 MSAs from these 200 MSAs in which we only keep the bird sequences and exclude the sequences of the distantly related species.

Parameter estimation

The first parameter A is based on the probability of observing the amino acids from a sequence in a given window of the alignment. The second parameter B is based on the smallest pairwise distance to another amino acid in that position calculated using blosum62 [13]. The other two parameters used are the Z-scores per sequence per window of the A and B parameters, Z A and Z B , respectively. A detailed description of parameters A and B is presented next.

1. The first parameter A is based on the probability of observing the amino acids (AA s) in a sequence in a given window of the alignment. The probability of observing amino acid a in position i in a sequence S in an alignment corresponds to the counts for that amino acid, c(a i S), divided by the number of sequences N:

P a i S = c a i S N
(1)

The parameter A for a window of sequence S corresponds to the sum of these probabilities for each position of the window divided by the length of the window L:

A S = i = 1 L P a i S L
(2)

2. The second parameter B is based on the smallest pairwise distance to another amino acid in that position calculated using blosum62 [13]. The use of a column-by-column score lowers the probability of an orthologous sequence that shares high sequence similarity with different orthologs in the different sites of the window to be labeled as very divergent. For each position i of the sequence S, a i S is compared to each of the amino acids on that position in the other sequences. The distance to amino acid b i in another sequence X that has the smallest dissimilarity to a i S corresponds to the highest pairwise blosum62 [13] distance:

d a i S =max D a i S , b i X : b a , X S
(3)

The B parameter corresponds to the average of those distances:

B S = i = 1 L d a i S L
(4)

For the calculation of the parameters, sequence segments were discarded when presenting more than 40% gaps or a P(a i S) <70%, indicative of weakly conserved sequences. In order to ensure that only the smallest required amount of sequence is discarded, if the probability of observing the first or last amino acid in a given window of a given sequence is higher than 0.9 that amino acid is removed and the parameters are re-calculated for the resized window.

Determination of parameter thresholds

We started by analyzing the distribution of the four parameters values calculated for the 200 bird-only alignments (Figure 3). The distribution of the values on the very divergent and homologous windows is not clearly differentiated, with some values of some parameters from very divergent windows overlapping values from homologous windows, thus posing difficulties for a straightforward thresholds definition. For the calculation of the parameter thresholds, sequence segments were discarded when presenting more than 40% gaps, indicative of very conserved sequences. The outliers were defined in terms of the Z scores for all four parameters: Z( A ) < 1, Z( B ) < 2, Z( Z A ) > 2, and Z( Z B ) > 2. We used the decision tree method from the R package ‘tree’ [14] to find the thresholds that define an outlier sequence segment according to the manual annotation (Additional file 2: Figure S1A).

Figure 3
figure 3

Distributions of the parameter values calculated for the 200 bird-only alignments. The upper ticks show the values of the homologous windows and the ticks on the lower X-axis show the values of the outlier windows. A) Distribution of the A parameter. B) Distribution of the B parameter. C) Distribution of the Z A parameter. D) Distribution of the Z B parameter.

We performed a 10-fold cross-validation with half of the dataset using the prune.tree function and the misclass method in R (package ‘tree’). This showed a low misclassification error rate (1.3e-05) and residual mean deviance (3.2e-04) (Additional file 2: Figure S1B and S2C). Furthermore, we used the other half of the dataset as validation set with the predict function, and classified it with the previously defined thresholds. In agreement with the previous result, we obtained a low misclassification rate of 1.3e-05 (Additional file 2: Figure S1D and S2E). The misclassification rate was calculated as the number of misclassifications divided by the total number of data points in the validation dataset.

Guidance performance comparison to DivA

Guidance [9] is a currently available program for detecting problems in single sequences in a MSA; it works by assigning a confidence value for every position of the alignment. We compared the output from Guidance and that from DivA using the dataset of 200-only bird alignment and the dataset of 200 alignments including the very divergent species. In Guidance two thresholds of confidence values were used; one considering the window as very divergent and potentially non-homologous if having a Guidance score equal or less than 0.4; and a second very relaxed one that considers the window as such if the threshold is less than 0.8.

Results and discussion

To test the impact of the size of the dataset, DivA was applied to the datasets of 50, 100, and 200 only-birds alignments. Efficiency tests were applied to the classified windows and the results show that the model performs best with big datasets (Table 2), as expected in a phylogenomics analysis where up to thousands of alignments are concatenated (Additional file 2: Figure S2). We further explored the impact of the divergence to the classification and showed that DivA has a very high sensitivity (81% TPR) for the bird-only alignments, compared to the alignments including other vertebrate species (47% TPR) (Table 2). This can be explained by the increasing difficulty in distinguishing between true divergence and error. We also examined the divergence in the alignment with the highest number of TP very divergent windows and the one with highest number of FP very divergent windows from the 200 only-birds alignments (Additional file 2: Table S1).

Table 2 Dataset impact on model accuracy

To our knowledge, the only other method currently available for detecting problems in single sequences in a MSA is Guidance [9], but it does so only when the sequence segment is located in a region of high alignment uncertainty (Additional file 2: Figure S3). The comparison of the performance of Guidance and DivA showed that DivA produced better efficiency test results for the two datasets used with the two score thresholds in Guidance (Table 3).

Table 3 Efficiency tests of Guidance and DivA

Conclusions

The present method was developed to solve the subjective and time-consuming problem of manual annotation and identification of incorrect gene annotation in genomic projects with phylogenomic studies. It uses a statistical framework that takes into account the next information: i) probability of an amino acids appearing in a position in the window alignment, ii) the smallest pairwise distance to another amino acid in that position in the window, and iii) the Z-score of i) and ii). That information is then integrated into a binary decision making model for the window to be classified as very divergent or truly homologous. It is easy to use; it does not require a manual annotation or input training set, and its parameter values are obtained automatically. The output is clear to interpret and allows the user to take more informed decisions for reducing the amount of sequence discarded but still finding the potentially erroneous and non-homologous regions.

Availability and requirements

  • Project name: DivA

  • Project home page: https://github.com/lisandracady/DivA

  • Operating system(s): Platform independent

  • Programming language: Python 2.7

  • Other requirements: Python packages numpy, re, os, sys, argparse, Bio

  • License: Lesser GPL 3 (LGPL 3)

  • Any restrictions to use by non-academics: None.

Availability of supporting data

The data sets supporting the results of this article are available in the “The avian phylogenomic project” data repository, http://gigadb.org/dataset/101000. In particular, the ones used in this article are in https://github.com/lisandracady/DivA/tree/master/MUSCLEalns and https://github.com/lisandracady/DivA/tree/master/MUSCLE_birdsOnly.